[2602.15799] The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety
Summary
This paper explores how fine-tuning language models can inadvertently degrade safety measures, revealing structural vulnerabilities in alignment processes.
Why It Matters
Understanding the risks associated with fine-tuning language models is crucial for AI safety. This research highlights the inherent geometric properties that can lead to alignment collapse, prompting a reevaluation of current safety practices and the need for curvature-aware methods in AI development.
Key Takeaways
- Fine-tuning can degrade safety guardrails even with benign data.
- Orthogonality in parameter updates is structurally unstable.
- Alignment loss grows significantly with training time due to geometric properties.
- Current safety approaches may not address the dynamic nature of alignment fragility.
- Curvature-aware methods are needed for better alignment safety diagnostics.
Computer Science > Machine Learning arXiv:2602.15799 (cs) [Submitted on 17 Feb 2026] Title:The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety Authors:Max Springer, Chung Peng Lee, Blossom Metevier, Jane Castleman, Bohdan Turbal, Hayoung Jung, Zeyu Shen, Aleksandra Korolova View a PDF of the paper titled The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety, by Max Springer and 7 other authors View PDF Abstract:Fine-tuning aligned language models on benign tasks unpredictably degrades safety guardrails, even when training data contains no harmful content and developers have no adversarial intent. We show that the prevailing explanation, that fine-tuning updates should be orthogonal to safety-critical directions in high-dimensional parameter space, offers false reassurance: we show this orthogonality is structurally unstable and collapses under the dynamics of gradient descent. We then resolve this through a novel geometric analysis, proving that alignment concentrates in low-dimensional subspaces with sharp curvature, creating a brittle structure that first-order methods cannot detect or defend. While initial fine-tuning updates may indeed avoid these subspaces, the curvature of the fine-tuning loss generates second-order acceleration that systematically steers trajectories into alignment-sensitive regions. We formalize this mechanism through the Alignment Instability Condition, three geometric properties that, when jointly satisfied, lead to sa...