[2603.21396] Mechanisms of Introspective Awareness
About this article
Abstract page for arXiv paper 2603.21396: Mechanisms of Introspective Awareness
Computer Science > Machine Learning arXiv:2603.21396 (cs) [Submitted on 22 Mar 2026] Title:Mechanisms of Introspective Awareness Authors:Uzay Macar, Li Yang, Atticus Wang, Peter Wallich, Emmanuel Ameisen, Jack Lindsey View a PDF of the paper titled Mechanisms of Introspective Awareness, by Uzay Macar and 5 other authors View PDF HTML (experimental) Abstract:Recent work shows that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept, a phenomenon cited as evidence of "introspective awareness." But what mechanisms underlie this capability, and do they reflect genuine introspective circuitry or more shallow heuristics? We investigate these questions in open-source models and establish three main findings. First, introspection is behaviorally robust: detection achieves moderate true positive rates with 0% false positives across diverse prompts. We also find this capability emerges specifically from post-training rather than pretraining. Second, introspection is not reducible to a single linear confound: anomaly detection relies on distributed MLP computation across multiple directions, implemented by evidence carrier and gate features. Third, models possess greater introspective capability than is elicited by default: ablating refusal directions improves detection by 53pp and a trained steering vector by 75pp. Overall, our results suggest that introspective awareness is behaviorally robust, grounded in nontri...