[2506.10947] Spurious Rewards: Rethinking Training Signals in RLVR

[2506.10947] Spurious Rewards: Rethinking Training Signals in RLVR

arXiv - Machine Learning 4 min read Article

Summary

The paper explores the impact of spurious rewards in reinforcement learning with verifiable rewards (RLVR), demonstrating how they can enhance model performance despite lacking correlation with correct answers.

Why It Matters

This research is significant as it challenges conventional beliefs about reward structures in reinforcement learning, highlighting the need for diverse validation across different models. It suggests that reliance on spurious rewards can lead to misleading conclusions about model capabilities, which is critical for developers and researchers in AI.

Key Takeaways

  • Spurious rewards can improve performance in RLVR despite low correlation with correct answers.
  • The effectiveness of spurious rewards is model-dependent, with significant variations across different AI models.
  • Validation of RL methods should encompass a range of models to avoid over-reliance on specific training signals.
  • Counterintuitive results highlight the importance of understanding underlying model behaviors.
  • Random rewards can yield performance gains comparable to ground-truth rewards in certain contexts.

Computer Science > Artificial Intelligence arXiv:2506.10947 (cs) [Submitted on 12 Jun 2025 (v1), last revised 25 Feb 2026 (this version, v2)] Title:Spurious Rewards: Rethinking Training Signals in RLVR Authors:Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, Yulia Tsvetkov, Hannaneh Hajishirzi, Pang Wei Koh, Luke Zettlemoyer View a PDF of the paper titled Spurious Rewards: Rethinking Training Signals in RLVR, by Rulin Shao and 13 other authors View PDF Abstract:We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR training with GRPO improves MATH-500 performance for Qwen2.5-Math-7B by 21.4 percentage points using randomly assigned rewards, nearly matching the 29.1-point gain from ground-truth rewards. To explain this counterintuitive observation, we show that GRPO exhibits a clipping bias from the clip term, which can amplify high-prior behaviors learned during pretraining even without informative rewards. As a case study, we identify one such behavior in Qwen2.5-Math models, which we call code reasoning -- reasoning in code without actual code execution; code-reasoning frequency increases from 65 percent to over 90 percent with spurious rewards. However, the presence of such amplifiable ...

Related Articles

Llms

Is the Mirage Effect a bug, or is it Geometric Reconstruction in action? A framework for why VLMs perform better "hallucinating" than guessing, and what that may tell us about what's really inside these models

Last week, a team from Stanford and UCSF (Asadi, O'Sullivan, Fei-Fei Li, Euan Ashley et al.) dropped two companion papers. The first, MAR...

Reddit - Artificial Intelligence · 1 min ·
Llms

Paper Finds That Leading AI Chatbots Like ChatGPT and Claude Remain Incredibly Sycophantic, Resulting in Twisted Effects on Users

https://futurism.com/artificial-intelligence/paper-ai-chatbots-chatgpt-claude-sycophantic Your AI chatbot isn’t neutral. Trust its advice...

Reddit - Artificial Intelligence · 1 min ·
Claude Code leak exposes a Tamagotchi-style ‘pet’ and an always-on agent | The Verge
Llms

Claude Code leak exposes a Tamagotchi-style ‘pet’ and an always-on agent | The Verge

Anthropic says “human error” resulted in a leak that exposed Claude Code’s source code. The leaked code, which has since been copied to G...

The Verge - AI · 4 min ·
You can now use ChatGPT with Apple’s CarPlay | The Verge
Llms

You can now use ChatGPT with Apple’s CarPlay | The Verge

ChatGPT is now accessible from your CarPlay dashboard if you have iOS 26.4 or newer and the latest version of the ChatGPT app.

The Verge - AI · 3 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime