[2510.08646] Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy
About this article
Abstract page for arXiv paper 2510.08646: Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy
Computer Science > Machine Learning arXiv:2510.08646 (cs) [Submitted on 9 Oct 2025 (v1), last revised 3 Mar 2026 (this version, v2)] Title:Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy Authors:Eric Hanchen Jiang, Weixuan Ou, Run Liu, Shengyuan Pang, Guancheng Wan, Ranjie Duan, Wei Dong, Kai-Wei Chang, XiaoFeng Wang, Ying Nian Wu, Xinfeng Li View a PDF of the paper titled Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy, by Eric Hanchen Jiang and 10 other authors View PDF HTML (experimental) Abstract:Safety alignment of large language models currently faces a central challenge: existing alignment techniques often prioritize mitigating responses to harmful prompts at the expense of overcautious behavior, leading models to incorrectly refuse benign requests. A key goal of safe alignment is therefore to improve safety while simultaneously minimizing false refusals. In this work, we introduce Energy Landscape Steering (ELS), a novel, fine-tuning free framework designed to resolve this challenge through dynamic, inference-time intervention. We train a lightweight external Energy-Based Model (EBM) to assign high energy to undesirable states (false refusal or jailbreak) and low energy to desirable states (helpful response or safe reject). During inference, the EBM maps the LLM's internal activations to an energy landscape, and we use the gradient of the energy function to steer the hidden ...