[2603.03291] One Bias After Another: Mechanistic Reward Shaping and

[2603.03291] One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

arXiv - AI March 05, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.03291: One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

Computer Science > Computation and Language arXiv:2603.03291 (cs) [Submitted on 6 Feb 2026] Title:One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models Authors:Daniel Fein, Max Lamparth, Violet Xiang, Mykel J. Kochenderfer, Nick Haber View a PDF of the paper titled One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models, by Daniel Fein and 4 other authors View PDF HTML (experimental) Abstract:Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning is vulnerable to reward hacking, whereby LM policies learn undesirable behaviors from flawed RMs. By systematically measuring biases in five high-quality RMs, including the state-of-the-art, we find that issues persist despite prior work with respect to length, sycophancy, and overconfidence. We also discover new issues related to bias toward model-specific styles and answer-order. We categorize RM failures by complexity and propose a simple post-hoc intervention to mitigate low-complexity biases that arise from spurious correlations. Our proposed mechanistic reward shaping reduces targeted biases without degrading reward quality and while using minimal labeled data. The method is extensible to new biases, model-internal, and generalizes out-of-distribution. Comments: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2603....

Originally published on March 05, 2026. Curated by AI News.

Llms

[2603.16629] MLLM-based Textual Explanations for Face Comparison

Abstract page for arXiv paper 2603.16629: MLLM-based Textual Explanations for Face Comparison

arXiv - AI · 4 min · 40 minutes ago

Llms

[2603.15159] To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

Abstract page for arXiv paper 2603.15159: To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

arXiv - AI · 4 min · 40 minutes ago

Llms

[2602.08316] SWE Context Bench: A Benchmark for Context Learning in Coding

Abstract page for arXiv paper 2602.08316: SWE Context Bench: A Benchmark for Context Learning in Coding

arXiv - AI · 4 min · 40 minutes ago

Llms

[2601.13227] Insider Knowledge: How Much Can RAG Systems Gain from Evaluation Secrets?

Abstract page for arXiv paper 2601.13227: Insider Knowledge: How Much Can RAG Systems Gain from Evaluation Secrets?

arXiv - AI · 3 min · 40 minutes ago

[2603.03291] One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

About this article

Related Articles

[2603.16629] MLLM-based Textual Explanations for Face Comparison

[2603.15159] To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

[2602.08316] SWE Context Bench: A Benchmark for Context Learning in Coding

[2601.13227] Insider Knowledge: How Much Can RAG Systems Gain from Evaluation Secrets?

No comments

Stay updated with AI News