[2511.21399] Steering Awareness: Models Can Be Trained to Detect Activation Steering
About this article
Abstract page for arXiv paper 2511.21399: Steering Awareness: Models Can Be Trained to Detect Activation Steering
Computer Science > Computation and Language arXiv:2511.21399 (cs) [Submitted on 26 Nov 2025 (v1), last revised 4 Mar 2026 (this version, v2)] Title:Steering Awareness: Models Can Be Trained to Detect Activation Steering Authors:Joshua Fonseca Rivera, David Demitri Africa View a PDF of the paper titled Steering Awareness: Models Can Be Trained to Detect Activation Steering, by Joshua Fonseca Rivera and 1 other authors View PDF HTML (experimental) Abstract:Activation steering - adding a vector to a language model's residual stream - is widely used to elicit latent behaviors and to probe safety-relevant properties. Many steering-based evaluations implicitly assume that the model cannot tell when such an intervention has occurred. We test this assumption by fine-tuning models to report (i) whether a steering vector was injected and (ii) which concept was injected, a capability we call steering awareness. Across seven open-source instruction-tuned models, the best achieves 95.5% detection on held-out concepts and 71.2% concept identification, with no false positives on our clean controls. We find that such detection transfers to novel vectors extracted by methods that produce directions aligned with contrastive activation addition, but fail for geometrically dissimilar methods. Crucially, detection does not confer behavioral robustness; detection-trained models are consistently more susceptible to steering in realistic settings than their base counterparts. Mechanistically, ste...