[2508.06249] In-Training Defenses against Emergent Misalignment in Language Models
About this article
Abstract page for arXiv paper 2508.06249: In-Training Defenses against Emergent Misalignment in Language Models
Computer Science > Machine Learning arXiv:2508.06249 (cs) [Submitted on 8 Aug 2025 (v1), last revised 5 Mar 2026 (this version, v2)] Title:In-Training Defenses against Emergent Misalignment in Language Models Authors:David Kaczér, Magnus Jørgenvåg, Clemens Vetter, Esha Afzal, Robin Haselhorst, Lucie Flek, Florian Mai View a PDF of the paper titled In-Training Defenses against Emergent Misalignment in Language Models, by David Kacz\'er and 6 other authors View PDF HTML (experimental) Abstract:Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API: We evaluate whether they a) prevent broad misalignment, b) allow narrow misalignment, c) learn well on benign tasks, and d) remain coherent. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) $\mathcal{l}_2$ distance in feature space, (iii) preventative steering with an evil persona vector, and (iv) interleaving trai...