[2602.12418] Sparse Autoencoders are Capable LLM Jailbreak Mitigators
Summary
The paper presents Context-Conditioned Delta Steering (CC-Delta), a defense mechanism using Sparse Autoencoders (SAEs) to mitigate jailbreak attacks on large language models, demonstrating improved safety-utility tradeoffs over traditional methods.
Why It Matters
As jailbreak attacks pose significant risks to the integrity and safety of large language models, this research offers a novel approach that leverages sparse autoencoders. By enhancing the defenses against these attacks, it contributes to the broader field of AI safety, making models more robust and reliable in real-world applications.
Key Takeaways
- CC-Delta utilizes sparse features to improve defense against jailbreak attacks.
- The method shows better performance than traditional dense latent space defenses.
- SAEs can be repurposed for jailbreak mitigation without requiring task-specific training.
- The approach is validated across multiple models and attack scenarios.
- Improving AI safety is crucial as large language models become more prevalent.
Computer Science > Cryptography and Security arXiv:2602.12418 (cs) [Submitted on 12 Feb 2026] Title:Sparse Autoencoders are Capable LLM Jailbreak Mitigators Authors:Yannick Assogba, Jacopo Cortellazzi, Javier Abad, Pau Rodriguez, Xavier Suau, Arno Blaas View a PDF of the paper titled Sparse Autoencoders are Capable LLM Jailbreak Mitigators, by Yannick Assogba and 5 other authors View PDF HTML (experimental) Abstract:Jailbreak attacks remain a persistent threat to large language model safety. We propose Context-Conditioned Delta Steering (CC-Delta), an SAE-based defense that identifies jailbreak-relevant sparse features by comparing token-level representations of the same harmful request with and without jailbreak context. Using paired harmful/jailbreak prompts, CC-Delta selects features via statistical testing and applies inference-time mean-shift steering in SAE latent space. Across four aligned instruction-tuned models and twelve jailbreak attacks, CC-Delta achieves comparable or better safety-utility tradeoffs than baseline defenses operating in dense latent space. In particular, our method clearly outperforms dense mean-shift steering on all four models, and particularly against out-of-distribution attacks, showing that steering in sparse SAE feature space offers advantages over steering in dense activation space for jailbreak mitigation. Our results suggest off-the-shelf SAEs trained for interpretability can be repurposed as practical jailbreak defenses without task-s...