[2602.17546] Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
Summary
This article presents a novel training framework for instruction-following language models that maintains safety during fine-tuning by adapting regularization based on safety risks.
Why It Matters
As AI systems become more integrated into society, ensuring their safety while maintaining utility is critical. This research addresses the challenge of safety degradation during model fine-tuning, providing a method that could enhance the reliability of AI applications in sensitive areas.
Key Takeaways
- Introduces adaptive regularization to mitigate safety risks during fine-tuning.
- Utilizes two approaches for estimating safety risk: a judge-based Safety Critic and an activation-based risk predictor.
- Empirical results show that the proposed method reduces attack success rates while preserving model performance.
Computer Science > Computation and Language arXiv:2602.17546 (cs) [Submitted on 19 Feb 2026] Title:Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning Authors:Jyotin Goel, Souvik Maji, Pratik Mazumder View a PDF of the paper titled Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning, by Jyotin Goel and 2 other authors View PDF HTML (experimental) Abstract:Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effect...