[2602.12579] VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction
Summary
The paper introduces VI-CuRL, a framework for stabilizing verifier-independent reinforcement learning (RL) by utilizing confidence-guided variance reduction, addressing challenges in training stability and performance.
Why It Matters
As reinforcement learning with verifiable rewards becomes more prevalent, the limitations of relying on external verifiers hinder scalability. VI-CuRL offers a solution by enabling RL to operate independently of these verifiers, promoting stability and efficiency in training models, which is crucial for advancing AI capabilities.
Key Takeaways
- VI-CuRL addresses critical issues of gradient variance in verifier-independent RL.
- The framework leverages intrinsic model confidence to enhance training stability.
- Empirical results show that VI-CuRL outperforms existing verifier-independent methods across multiple benchmarks.
- Theoretical analysis confirms the asymptotic unbiasedness of the proposed estimator.
- This approach could significantly advance the scalability of RL applications in AI.
Computer Science > Machine Learning arXiv:2602.12579 (cs) [Submitted on 13 Feb 2026] Title:VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction Authors:Xin-Qiang Cai, Masashi Sugiyama View a PDF of the paper titled VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction, by Xin-Qiang Cai and 1 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a dominant paradigm for enhancing Large Language Models (LLMs) reasoning, yet its reliance on external verifiers limits its scalability. Recent findings suggest that RLVR primarily functions by eliciting latent capabilities, motivating the development of verifier-free algorithms. However, in such settings, standard methods like Group Relative Policy Optimization face a critical challenge: destructive gradient variance that often leads to training collapse. To address this issue, we introduceVerifier-Independent Curriculum Reinforcement Learning (VI-CuRL), a framework that leverages the model's intrinsic confidence to construct a curriculum independent from external verifiers. By prioritizing high-confidence samples, VI-CuRL effectively manages the bias-variance trade-off, specifically targeting the reduction of action and problem variance. We provide a rigorous theoretical analysis, proving that our estimator guarantees asymptotic unbiasedness. Empirically, VI-CuRL promotes stability...