[2605.01913] RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs
About this article
Abstract page for arXiv paper 2605.01913: RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs
Computer Science > Machine Learning arXiv:2605.01913 (cs) [Submitted on 3 May 2026] Title:RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs Authors:Sadia Asif, Mohammad Mohammadi Amiri View a PDF of the paper titled RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs, by Sadia Asif and 1 other authors View PDF HTML (experimental) Abstract:Fine-tuning safety-aligned language models for downstream tasks often leads to substantial degradation of refusal behavior, making models vulnerable to adversarial misuse. While prior work has shown that safety-relevant features are encoded in structured representations within the model's activation space, how these representations change during fine-tuning and why alignment degrades remains poorly understood. In this work, we investigate the representation-level mechanisms underlying alignment degradation. Our analysis shows that standard fine-tuning induces systematic drift in safety-relevant representations, distorts their geometric structure, and introduces interference between task optimization and safety features. These effects collectively lead to increased harmful compliance. Motivated by these findings, we introduce REFUSALGUARD, a representation-level fine-tuning framework that preserves safety-relevant structure during model adaptation. Our approach constrains updates in hidden representation space, ensuring that safety-mediating components remain stable while allowing task-specific learning in compl...