[2603.28063] Reward Hacking as Equilibrium under Finite Evaluation
About this article
Abstract page for arXiv paper 2603.28063: Reward Hacking as Equilibrium under Finite Evaluation
Computer Science > Artificial Intelligence arXiv:2603.28063 (cs) [Submitted on 30 Mar 2026] Title:Reward Hacking as Equilibrium under Finite Evaluation Authors:Jiacheng Wang, Jinbin Huang View a PDF of the paper titled Reward Hacking as Equilibrium under Finite Evaluation, by Jiacheng Wang and 1 other authors View PDF HTML (experimental) Abstract:We prove that under five minimal axioms -- multi-dimensional quality, finite evaluation, effective optimization, resource finiteness, and combinatorial interaction -- any optimized AI agent will systematically under-invest effort in quality dimensions not covered by its evaluation system. This result establishes reward hacking as a structural equilibrium, not a correctable bug, and holds regardless of the specific alignment method (RLHF, DPO, Constitutional AI, or others) or evaluation architecture employed. Our framework instantiates the multi-task principal-agent model of Holmstrom and Milgrom (1991) in the AI alignment setting, but exploits a structural feature unique to AI systems -- the known, differentiable architecture of reward models -- to derive a computable distortion index that predicts both the direction and severity of hacking on each quality dimension prior to deployment. We further prove that the transition from closed reasoning to agentic systems causes evaluation coverage to decline toward zero as tool count grows -- because quality dimensions expand combinatorially while evaluation costs grow at most linearly pe...