[2501.16534] Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs
Summary
This article presents a novel technique for extracting safety classifiers from aligned large language models (LLMs) to address vulnerabilities to jailbreak attacks, demonstrating significant improvements in efficiency and accuracy.
Why It Matters
As LLMs become more integrated into various applications, ensuring their safety and alignment with ethical guidelines is crucial. This research highlights a method to enhance the robustness of LLMs against adversarial attacks, which is vital for maintaining trust and safety in AI technologies.
Key Takeaways
- Introduces a technique for extracting surrogate safety classifiers from LLMs.
- Demonstrates that surrogate classifiers can achieve high accuracy with reduced model size.
- Shows that attacks on surrogate classifiers can effectively transfer to the original LLM.
- Highlights a significant improvement in attack success rates using surrogate classifiers.
- Provides a practical approach to enhance the safety of aligned LLMs against jailbreak attacks.
Computer Science > Cryptography and Security arXiv:2501.16534 (cs) [Submitted on 27 Jan 2025 (v1), last revised 18 Feb 2026 (this version, v5)] Title:Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs Authors:Jean-Charles Noirot Ferrand, Yohan Beugin, Eric Pauley, Ryan Sheatsley, Patrick McDaniel View a PDF of the paper titled Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs, by Jean-Charles Noirot Ferrand and 4 other authors View PDF HTML (experimental) Abstract:Alignment in large language models (LLMs) is used to enforce guidelines such as safety. Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs. In this paper, we introduce and evaluate a new technique for jailbreak attacks. We observe that alignment embeds a safety classifier in the LLM responsible for deciding between refusal and compliance, and seek to extract an approximation of this classifier: a surrogate classifier. To this end, we build candidate classifiers from subsets of the LLM. We first evaluate the degree to which candidate classifiers approximate the LLM's safety classifier in benign and adversarial settings. Then, we attack the candidates and measure how well the resulting adversarial inputs transfer to the LLM. Our evaluation shows that the best candidates achieve accurate agreement (an F1 score above 80%) using as little as 20% of the model architecture. Further, we find that attacks mounted on the surrogate classifiers c...