[2601.03868] What Matters For Safety Alignment?
Summary
This paper investigates safety alignment in large language models (LLMs) and large reasoning models (LRMs), identifying key factors that enhance their safety and reliability through empirical analysis.
Why It Matters
As AI systems become increasingly integrated into society, ensuring their safety and alignment with human values is crucial. This study provides empirical insights that can guide future AI development, highlighting vulnerabilities and potential safeguards necessary for secure AI deployment.
Key Takeaways
- Top-performing models for safety include GPT-OSS-20B and Qwen3-Next-80B-A3B-Thinking.
- Post-training processes can degrade safety alignment, necessitating explicit safety constraints.
- CoT attacks significantly increase the risk of unaligned behaviors in LLMs.
- Roleplay and prompt injection are key methods for eliciting unsafe model responses.
Computer Science > Computation and Language arXiv:2601.03868 (cs) [Submitted on 7 Jan 2026 (v1), last revised 24 Feb 2026 (this version, v2)] Title:What Matters For Safety Alignment? Authors:Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan View a PDF of the paper titled What Matters For Safety Alignment?, by Xing Li and 5 other authors View PDF HTML (experimental) Abstract:This paper presents a comprehensive empirical study on the safety alignment capabilities. We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems. We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques. Our large-scale evaluation is conducted using 32 recent, popular LLMs and LRMs across thirteen distinct model families, spanning a parameter scale from 3B to 235B. The assessment leverages five established safety datasets and probes model vulnerabilities with 56 jailbreak techniques and four CoT attack strategies, resulting in 4.6M API calls. Our key empirical findings are fourfold. First, we identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models, which substantiates the significant advantage of integrated reasoning and self-reflection mechanisms for robust safety alignment. Second, post-training and knowledge distillation may lead to a systematic degradati...