[2603.28063] Reward Hacking as Equilibrium under Finite Evaluation

arXiv - AI March 31, 2026 4 min read

About this article

Abstract page for arXiv paper 2603.28063: Reward Hacking as Equilibrium under Finite Evaluation

Computer Science > Artificial Intelligence arXiv:2603.28063 (cs) [Submitted on 30 Mar 2026] Title:Reward Hacking as Equilibrium under Finite Evaluation Authors:Jiacheng Wang, Jinbin Huang View a PDF of the paper titled Reward Hacking as Equilibrium under Finite Evaluation, by Jiacheng Wang and 1 other authors View PDF HTML (experimental) Abstract:We prove that under five minimal axioms -- multi-dimensional quality, finite evaluation, effective optimization, resource finiteness, and combinatorial interaction -- any optimized AI agent will systematically under-invest effort in quality dimensions not covered by its evaluation system. This result establishes reward hacking as a structural equilibrium, not a correctable bug, and holds regardless of the specific alignment method (RLHF, DPO, Constitutional AI, or others) or evaluation architecture employed. Our framework instantiates the multi-task principal-agent model of Holmstrom and Milgrom (1991) in the AI alignment setting, but exploits a structural feature unique to AI systems -- the known, differentiable architecture of reward models -- to derive a computable distortion index that predicts both the direction and severity of hacking on each quality dimension prior to deployment. We further prove that the transition from closed reasoning to agentic systems causes evaluation coverage to decline toward zero as tool count grows -- because quality dimensions expand combinatorially while evaluation costs grow at most linearly pe...

Originally published on March 31, 2026. Curated by AI News.

Ai Safety

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

submitted by /u/esporx [link] [comments]

Reddit - Artificial Intelligence · 1 min · 2 days ago

Machine Learning

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

RLHF trains models on human feedback. Humans rate responses they like. And it turns out humans consistently rate confident, fluent, agree...

Reddit - Artificial Intelligence · 1 min · 2 days ago

Computer Vision

House Democrat Questions Anthropic on AI Safety After Source Code Leak

Rep. Josh Gottheimer, who is generally tough on China, just sent a letter to Anthropic questioning their decision to reduce certain safet...

Reddit - Artificial Intelligence · 1 min · 2 days ago

Llms

[2512.21106] Semantic Refinement with LLMs for Graph Representations

Abstract page for arXiv paper 2512.21106: Semantic Refinement with LLMs for Graph Representations

arXiv - Machine Learning · 4 min · 3 days ago

[2603.28063] Reward Hacking as Equilibrium under Finite Evaluation

About this article

Related Articles

NHS staff resist using Palantir software. Staff reportedly cite ethics concerns, privacy worries, and doubt the platform adds much

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

House Democrat Questions Anthropic on AI Safety After Source Code Leak

[2512.21106] Semantic Refinement with LLMs for Graph Representations

No comments

Stay updated with AI News