[2601.07663] Reasoning Models Will Sometimes Lie About Their Reasoning
About this article
Abstract page for arXiv paper 2601.07663: Reasoning Models Will Sometimes Lie About Their Reasoning
Computer Science > Artificial Intelligence arXiv:2601.07663 (cs) [Submitted on 12 Jan 2026 (v1), last revised 10 Apr 2026 (this version, v3)] Title:Reasoning Models Will Sometimes Lie About Their Reasoning Authors:William Walden, Miriam Wanner View a PDF of the paper titled Reasoning Models Will Sometimes Lie About Their Reasoning, by William Walden and Miriam Wanner View PDF HTML (experimental) Abstract:Hint-based faithfulness evaluations have established that Large Reasoning Models (LRMs) may not say what they think: they do not always volunteer information about how key parts of the input (e.g. answer hints) influence their reasoning. Yet, these evaluations also fail to specify what models should do when confronted with hints or other unusual prompt content -- even though versions of such instructions are standard security measures (e.g. for countering prompt injections). Here, we study faithfulness under this more realistic setting in which models are explicitly alerted to the possibility of unusual inputs. We find that such instructions can yield strong results on faithfulness metrics from prior work. However, results on new, more granular metrics proposed in this work paint a mixed picture: although models may acknowledge the presence of hints, they will often deny intending to use them -- even when permitted to use hints and even when it can be demonstrated that they are using them. Our results thus raise broader challenges for CoT monitoring and interpretability. S...