[2511.03710] Shrinking the Variance: Shrinkage Baselines for Reinforcement Learning with Verifiable Rewards
Summary
This article presents a novel approach to reducing variance in reinforcement learning through shrinkage baselines, enhancing training stability and efficiency in models with verifiable rewards.
Why It Matters
The research addresses a critical challenge in reinforcement learning by proposing a method that improves the accuracy of reward estimation, which is essential for the effective training of large reasoning models. This advancement can lead to more reliable AI systems, particularly in applications requiring verified outcomes.
Key Takeaways
- Shrinkage estimators improve the accuracy of per-prompt mean estimation in reinforcement learning.
- The proposed shrinkage baseline reduces variance in policy-gradient estimators without additional computation.
- Empirical results show that shrinkage baselines outperform traditional empirical-mean baselines, enhancing training stability.
Computer Science > Machine Learning arXiv:2511.03710 (cs) [Submitted on 5 Nov 2025 (v1), last revised 18 Feb 2026 (this version, v2)] Title:Shrinking the Variance: Shrinkage Baselines for Reinforcement Learning with Verifiable Rewards Authors:Guanning Zeng, Zhaoyi Zhou, Daman Arora, Andrea Zanette View a PDF of the paper titled Shrinking the Variance: Shrinkage Baselines for Reinforcement Learning with Verifiable Rewards, by Guanning Zeng and 3 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for post-training large reasoning models (LRMs) using policy-gradient methods such as GRPO. To stabilize training, these methods typically center trajectory rewards by subtracting the empirical mean reward for each prompt. Statistically, this centering acts as a control variate (baseline), reducing the variance of the policy-gradient estimator. In practice, the mean reward is estimated using per-prompt empirical averages computed from the generations for each prompt in a batch. Motivated by Stein's paradox, we propose shrinkage estimators that combine per-prompt and across-prompt means to improve per-prompt mean estimation accuracy, especially in the low-generation regime typical of RLVR. Theoretically, we construct a shrinkage-based baseline that provably yields lower-variance policy-gradient estimators across algorithms. Our baseline is a drop-in replacement for standard per-prompt mean basel...