[2602.17546] Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

[2602.17546] Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

arXiv - Machine Learning 4 min read Article

Summary

This article presents a novel training framework for instruction-following language models that maintains safety during fine-tuning by adapting regularization based on safety risks.

Why It Matters

As AI systems become more integrated into society, ensuring their safety while maintaining utility is critical. This research addresses the challenge of safety degradation during model fine-tuning, providing a method that could enhance the reliability of AI applications in sensitive areas.

Key Takeaways

  • Introduces adaptive regularization to mitigate safety risks during fine-tuning.
  • Utilizes two approaches for estimating safety risk: a judge-based Safety Critic and an activation-based risk predictor.
  • Empirical results show that the proposed method reduces attack success rates while preserving model performance.

Computer Science > Computation and Language arXiv:2602.17546 (cs) [Submitted on 19 Feb 2026] Title:Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning Authors:Jyotin Goel, Souvik Maji, Pratik Mazumder View a PDF of the paper titled Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning, by Jyotin Goel and 2 other authors View PDF HTML (experimental) Abstract:Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effect...

Related Articles

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED
Llms

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED

Plus: The FBI says a recent hack of its wiretap tools poses a national security risk, attackers stole Cisco source code as part of an ong...

Wired - AI · 9 min ·
Llms

People anxious about deviating from what AI tells them to do?

My friend came over yesterday to dye her hair. She had asked ChatGPT for the 'correct' way to do it. Chat told her to dye the ends first,...

Reddit - Artificial Intelligence · 1 min ·
Llms

ChatGPT on trial: A landmark test of AI liability in the practice of law

AI Tools & Products ·
Llms

What if Claude purposefully made its own code leakable so that it would get leaked

What if Claude leaked itself by socially and architecturally engineering itself to be leaked by a dumb human submitted by /u/smurfcsgoawp...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime