[2505.12186] Self-Destructive Language Model
About this article
Abstract page for arXiv paper 2505.12186: Self-Destructive Language Model
Computer Science > Machine Learning arXiv:2505.12186 (cs) [Submitted on 18 May 2025 (v1), last revised 2 Mar 2026 (this version, v2)] Title:Self-Destructive Language Model Authors:Yuhui Wang, Rongyi Zhu, Ting Wang View a PDF of the paper titled Self-Destructive Language Model, by Yuhui Wang and 2 other authors View PDF HTML (experimental) Abstract:Harmful fine-tuning attacks pose a major threat to the security of large language models (LLMs), allowing adversaries to compromise safety guardrails with minimal harmful data. While existing defenses attempt to reinforce LLM alignment, they fail to address models' inherent "trainability" on harmful data, leaving them vulnerable to stronger attacks with increased learning rates or larger harmful datasets. To overcome this critical limitation, we introduce SEAM, a novel alignment-enhancing defense that transforms LLMs into self-destructive models with intrinsic resilience to misalignment attempts. Specifically, these models retain their capabilities for legitimate tasks while exhibiting substantial performance degradation when fine-tuned on harmful data. The protection is achieved through a novel loss function that couples the optimization trajectories of benign and harmful data, enhanced with adversarial gradient ascent to amplify the self-destructive effect. To enable practical training, we develop an efficient Hessian-free gradient estimate with theoretical error bounds. Extensive evaluation across LLMs and datasets demonstrate...