[2511.03710] Shrinking the Variance: Shrinkage Baselines for Reinforcement Learning with Verifiable Rewards

[2511.03710] Shrinking the Variance: Shrinkage Baselines for Reinforcement Learning with Verifiable Rewards

arXiv - Machine Learning 3 min read Article

Summary

This article presents a novel approach to reducing variance in reinforcement learning through shrinkage baselines, enhancing training stability and efficiency in models with verifiable rewards.

Why It Matters

The research addresses a critical challenge in reinforcement learning by proposing a method that improves the accuracy of reward estimation, which is essential for the effective training of large reasoning models. This advancement can lead to more reliable AI systems, particularly in applications requiring verified outcomes.

Key Takeaways

  • Shrinkage estimators improve the accuracy of per-prompt mean estimation in reinforcement learning.
  • The proposed shrinkage baseline reduces variance in policy-gradient estimators without additional computation.
  • Empirical results show that shrinkage baselines outperform traditional empirical-mean baselines, enhancing training stability.

Computer Science > Machine Learning arXiv:2511.03710 (cs) [Submitted on 5 Nov 2025 (v1), last revised 18 Feb 2026 (this version, v2)] Title:Shrinking the Variance: Shrinkage Baselines for Reinforcement Learning with Verifiable Rewards Authors:Guanning Zeng, Zhaoyi Zhou, Daman Arora, Andrea Zanette View a PDF of the paper titled Shrinking the Variance: Shrinkage Baselines for Reinforcement Learning with Verifiable Rewards, by Guanning Zeng and 3 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for post-training large reasoning models (LRMs) using policy-gradient methods such as GRPO. To stabilize training, these methods typically center trajectory rewards by subtracting the empirical mean reward for each prompt. Statistically, this centering acts as a control variate (baseline), reducing the variance of the policy-gradient estimator. In practice, the mean reward is estimated using per-prompt empirical averages computed from the generations for each prompt in a batch. Motivated by Stein's paradox, we propose shrinkage estimators that combine per-prompt and across-prompt means to improve per-prompt mean estimation accuracy, especially in the low-generation regime typical of RLVR. Theoretically, we construct a shrinkage-based baseline that provably yields lower-variance policy-gradient estimators across algorithms. Our baseline is a drop-in replacement for standard per-prompt mean basel...

Related Articles

Llms

Continuous Knowledge Transfer Between Claude and Codex

For the last 8 months I've developed strictly using Claude Code, setting up context layers, hooks, skills, etc. But relying on one model ...

Reddit - Artificial Intelligence · 1 min ·
Anthropic's latest AI model identifies 'thousands of zero-day vulnerabilities' in 'every major operating system and every major web browser' — Claude Mythos Preview sparks race to fix critical bugs, some unpatched for decades
Llms

Anthropic's latest AI model identifies 'thousands of zero-day vulnerabilities' in 'every major operating system and every major web browser' — Claude Mythos Preview sparks race to fix critical bugs, some unpatched for decades

AI Tools & Products · 6 min ·
Anthropic says its latest AI model is too powerful for public release and that it broke containment during testing
Machine Learning

Anthropic says its latest AI model is too powerful for public release and that it broke containment during testing

AI Tools & Products · 5 min ·
Thinking small: How small language models could lessen the AI energy burden
Llms

Thinking small: How small language models could lessen the AI energy burden

According to researchers, for many industries, small language models may offer a host of advantages to energy- and resource-intensive lar...

AI Tools & Products · 5 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime