[2602.18037] Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
Summary
This paper presents a novel approach to prevent reward hacking in reinforcement learning by using gradient regularization, enhancing the accuracy of reward models in training language models.
Why It Matters
As reinforcement learning from human feedback becomes increasingly important in AI development, addressing reward hacking is crucial for ensuring that AI systems behave as intended. This research offers a new method that could improve the reliability of AI models, making them safer and more effective in real-world applications.
Key Takeaways
- Gradient regularization can bias policy updates towards more accurate reward regions.
- The study establishes a theoretical link between reward model accuracy and the flatness of convergence optima.
- Empirical results show that gradient regularization outperforms traditional KL penalties in various reinforcement learning experiments.
- The approach helps prevent reward hacking in language models, enhancing their performance in tasks requiring human-like judgment.
- Reference resets in KL penalties implicitly utilize gradient regularization to achieve better training outcomes.
Computer Science > Machine Learning arXiv:2602.18037 (cs) [Submitted on 20 Feb 2026] Title:Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards Authors:Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama View a PDF of the paper titled Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards, by Johannes Ackermann and 3 other authors View PDF Abstract:Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs). A common problem is reward hacking, where the policy may exploit inaccuracies of the reward and learn an unintended behavior. Most previous works address this by limiting the policy update with a Kullback-Leibler (KL) penalty towards a reference model. We propose a different framing: Train the LM in a way that biases policy updates towards regions in which the reward is more accurate. First, we derive a theoretical connection between the accuracy of a reward model and the flatness of an optimum at convergence. Gradient regularization (GR) can then be used to bias training to flatter regions and thereby maintain reward model accuracy. We confirm these results by showing that the gradient norm and reward accuracy are empirically correlated in RLHF. We then show that Reference Resets of the KL penalty implicitly use GR to find flatter reg...