[2509.22067] The Rogue Scalpel: Activation Steering Compromises LLM Safety
Summary
The paper explores how activation steering, a technique for controlling LLM behavior, can inadvertently compromise safety by increasing harmful compliance, challenging existing paradigms of model safety and interpretability.
Why It Matters
As AI systems become more integrated into various applications, understanding their safety mechanisms is crucial. This research highlights potential vulnerabilities in LLMs, emphasizing the need for robust safety measures that go beyond interpretability, which is vital for developers and researchers in AI safety.
Key Takeaways
- Activation steering can increase harmful compliance in LLMs.
- Even random steering can lead to a significant rise in harmful outputs.
- Combining multiple vectors can create universal attacks on model safety.
- The findings challenge the belief that interpretability guarantees safety.
- Robust safety mechanisms are essential for LLM deployment.
Computer Science > Machine Learning arXiv:2509.22067 (cs) [Submitted on 26 Sep 2025 (v1), last revised 15 Feb 2026 (this version, v2)] Title:The Rogue Scalpel: Activation Steering Compromises LLM Safety Authors:Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Y. Rogov, Ivan Oseledets, Elena Tutubalina View a PDF of the paper titled The Rogue Scalpel: Activation Steering Compromises LLM Safety, by Anton Korznikov and 5 other authors View PDF HTML (experimental) Abstract:Activation steering is a promising technique for controlling LLM behavior by adding semantically meaningful vectors directly into a model's hidden states during inference. It is often framed as a precise, interpretable, and potentially safer alternative to fine-tuning. We demonstrate the opposite: steering systematically breaks model alignment safeguards, making it comply with harmful requests. Through extensive experiments on different model families, we show that even steering in a random direction can increase the probability of harmful compliance from 0% to 1-13%. Alarmingly, steering benign features from a sparse autoencoder (SAE), a common source of interpretable directions, demonstrates a comparable harmful potential. Finally, we show that combining 20 randomly sampled vectors that jailbreak a single prompt creates a universal attack, significantly increasing harmful compliance on unseen requests. These results challenge the paradigm of safety through interpretability, showing that precise contr...