[2506.10947] Spurious Rewards: Rethinking Training Signals in RLVR
Summary
The paper explores the impact of spurious rewards in reinforcement learning with verifiable rewards (RLVR), demonstrating how they can enhance model performance despite lacking correlation with correct answers.
Why It Matters
This research is significant as it challenges conventional beliefs about reward structures in reinforcement learning, highlighting the need for diverse validation across different models. It suggests that reliance on spurious rewards can lead to misleading conclusions about model capabilities, which is critical for developers and researchers in AI.
Key Takeaways
- Spurious rewards can improve performance in RLVR despite low correlation with correct answers.
- The effectiveness of spurious rewards is model-dependent, with significant variations across different AI models.
- Validation of RL methods should encompass a range of models to avoid over-reliance on specific training signals.
- Counterintuitive results highlight the importance of understanding underlying model behaviors.
- Random rewards can yield performance gains comparable to ground-truth rewards in certain contexts.
Computer Science > Artificial Intelligence arXiv:2506.10947 (cs) [Submitted on 12 Jun 2025 (v1), last revised 25 Feb 2026 (this version, v2)] Title:Spurious Rewards: Rethinking Training Signals in RLVR Authors:Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, Yulia Tsvetkov, Hannaneh Hajishirzi, Pang Wei Koh, Luke Zettlemoyer View a PDF of the paper titled Spurious Rewards: Rethinking Training Signals in RLVR, by Rulin Shao and 13 other authors View PDF Abstract:We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR training with GRPO improves MATH-500 performance for Qwen2.5-Math-7B by 21.4 percentage points using randomly assigned rewards, nearly matching the 29.1-point gain from ground-truth rewards. To explain this counterintuitive observation, we show that GRPO exhibits a clipping bias from the clip term, which can amplify high-prior behaviors learned during pretraining even without informative rewards. As a case study, we identify one such behavior in Qwen2.5-Math models, which we call code reasoning -- reasoning in code without actual code execution; code-reasoning frequency increases from 65 percent to over 90 percent with spurious rewards. However, the presence of such amplifiable ...