[2603.23268] SafeSeek: Universal Attribution of Safety Circuits in Language Models
About this article
Abstract page for arXiv paper 2603.23268: SafeSeek: Universal Attribution of Safety Circuits in Language Models
Computer Science > Machine Learning arXiv:2603.23268 (cs) [Submitted on 24 Mar 2026] Title:SafeSeek: Universal Attribution of Safety Circuits in Language Models Authors:Miao Yu, Siyuan Fu, Moayad Aloqaily, Zhenhong Zhou, Safa Otoum, Xing fan, Kun Wang, Yufei Guo, Qingsong Wen View a PDF of the paper titled SafeSeek: Universal Attribution of Safety Circuits in Language Models, by Miao Yu and 8 other authors View PDF HTML (experimental) Abstract:Mechanistic interpretability reveals that safety-critical behaviors (e.g., alignment, jailbreak, backdoor) in Large Language Models (LLMs) are grounded in specialized functional components. However, existing safety attribution methods struggle with generalization and reliability due to their reliance on heuristic, domain-specific metrics and search algorithms. To address this, we propose \ourmethod, a unified safety interpretability framework that identifies functionally complete safety circuits in LLMs via optimization. Unlike methods focusing on isolated heads or neurons, \ourmethod introduces differentiable binary masks to extract multi-granular circuits through gradient descent on safety datasets, while integrates Safety Circuit Tuning to utilize these sparse circuits for efficient safety fine-tuning. We validate \ourmethod in two key scenarios in LLM safety: \textbf{(1) backdoor attacks}, identifying a backdoor circuit with 0.42\% sparsity, whose ablation eradicates the Attack Success Rate (ASR) from 100\% $\to$ 0.4\% while reta...