[2511.05408] Steering Language Models with Weight Arithmetic
About this article
Abstract page for arXiv paper 2511.05408: Steering Language Models with Weight Arithmetic
Computer Science > Computation and Language arXiv:2511.05408 (cs) [Submitted on 7 Nov 2025 (v1), last revised 27 Feb 2026 (this version, v2)] Title:Steering Language Models with Weight Arithmetic Authors:Constanza Fierro, Fabien Roger View a PDF of the paper titled Steering Language Models with Weight Arithmetic, by Constanza Fierro and 1 other authors View PDF HTML (experimental) Abstract:Providing high-quality feedback to Large Language Models (LLMs) on a diverse training distribution can be difficult and expensive, and providing feedback only on a narrow distribution can result in unintended generalizations. To better leverage narrow training data, we propose contrastive weight steering, a simple post-training method that edits the model parameters using weight arithmetic. We isolate a behavior direction in weight-space by subtracting the weight deltas from two small fine-tunes -- one that induces the desired behavior and another that induces its opposite -- and then add or remove this direction to modify the model's weights. We apply this technique to mitigate sycophancy and induce misalignment, and find that weight steering often generalizes further than activation steering, achieving stronger out-of-distribution behavioral control before degrading general capabilities. We also show that, in the context of task-specific fine-tuning, weight steering can partially mitigate undesired behavioral drift: it can reduce sycophancy and under-refusals introduced during fine-tun...