[2602.15799] The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety

[2602.15799] The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety

arXiv - AI 4 min read Article

Summary

This paper explores how fine-tuning language models can inadvertently degrade safety measures, revealing structural vulnerabilities in alignment processes.

Why It Matters

Understanding the risks associated with fine-tuning language models is crucial for AI safety. This research highlights the inherent geometric properties that can lead to alignment collapse, prompting a reevaluation of current safety practices and the need for curvature-aware methods in AI development.

Key Takeaways

  • Fine-tuning can degrade safety guardrails even with benign data.
  • Orthogonality in parameter updates is structurally unstable.
  • Alignment loss grows significantly with training time due to geometric properties.
  • Current safety approaches may not address the dynamic nature of alignment fragility.
  • Curvature-aware methods are needed for better alignment safety diagnostics.

Computer Science > Machine Learning arXiv:2602.15799 (cs) [Submitted on 17 Feb 2026] Title:The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety Authors:Max Springer, Chung Peng Lee, Blossom Metevier, Jane Castleman, Bohdan Turbal, Hayoung Jung, Zeyu Shen, Aleksandra Korolova View a PDF of the paper titled The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety, by Max Springer and 7 other authors View PDF Abstract:Fine-tuning aligned language models on benign tasks unpredictably degrades safety guardrails, even when training data contains no harmful content and developers have no adversarial intent. We show that the prevailing explanation, that fine-tuning updates should be orthogonal to safety-critical directions in high-dimensional parameter space, offers false reassurance: we show this orthogonality is structurally unstable and collapses under the dynamics of gradient descent. We then resolve this through a novel geometric analysis, proving that alignment concentrates in low-dimensional subspaces with sharp curvature, creating a brittle structure that first-order methods cannot detect or defend. While initial fine-tuning updates may indeed avoid these subspaces, the curvature of the fine-tuning loss generates second-order acceleration that systematically steers trajectories into alignment-sensitive regions. We formalize this mechanism through the Alignment Instability Condition, three geometric properties that, when jointly satisfied, lead to sa...

Related Articles

Llms

[D] How's MLX and jax/ pytorch on MacBooks these days?

​ So I'm looking at buying a new 14 inch MacBook pro with m5 pro and 64 gb of memory vs m4 max with same specs. My priorities are pro sof...

Reddit - Machine Learning · 1 min ·
Llms

[R] 94.42% on BANKING77 Official Test Split with Lightweight Embedding + Example Reranking (strict full-train protocol)

BANKING77 (77 fine-grained banking intents) is a well-established but increasingly saturated intent classification benchmark. did this wh...

Reddit - Machine Learning · 1 min ·
The “Agony” or ChatGPT: Would You Let AI Write Your Wedding Speech?
Llms

The “Agony” or ChatGPT: Would You Let AI Write Your Wedding Speech?

As more Americans use AI chatbots like ChatGPT to compose their wedding vows, one expert asks: “Is the speech sacred to you?”

AI Tools & Products · 12 min ·
I tested Gemini on Android Auto and now I can't stop talking to it: 5 tasks it nails
Llms

I tested Gemini on Android Auto and now I can't stop talking to it: 5 tasks it nails

I didn't see much benefit for Google's AI - until now. Here are my favorite ways to use the new Gemini integration in my car.

AI Tools & Products · 7 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime