[2603.04355] Efficient Refusal Ablation in LLM through Optimal Transport
About this article
Abstract page for arXiv paper 2603.04355: Efficient Refusal Ablation in LLM through Optimal Transport
Computer Science > Machine Learning arXiv:2603.04355 (cs) [Submitted on 4 Mar 2026] Title:Efficient Refusal Ablation in LLM through Optimal Transport Authors:Geraldin Nanfack, Eugene Belilovsky, Elvis Dohmatob View a PDF of the paper titled Efficient Refusal Ablation in LLM through Optimal Transport, by Geraldin Nanfack and 2 other authors View PDF HTML (experimental) Abstract:Safety-aligned language models refuse harmful requests through learned refusal behaviors encoded in their internal representations. Recent activation-based jailbreaking methods circumvent these safety mechanisms by applying orthogonal projections to remove refusal directions, but these approaches treat refusal as a one-dimensional phenomenon and ignore the rich distributional structure of model activations. We introduce a principled framework based on optimal transport theory that transforms the entire distribution of harmful activations to match harmless ones. By combining PCA with closed-form Gaussian optimal transport, we achieve efficient computation in high-dimensional representation spaces while preserving essential geometric structure. Across six models (Llama-2, Llama-3.1, Qwen-2.5; 7B-32B parameters), our method achieves up to 11% higher attack success rates than state-of-the-art baselines while maintaining comparable perplexity, demonstrating superior preservation of model capabilities. Critically, we discover that layer-selective intervention (applying optimal transport to 1-2 carefully ch...