[2602.17616] Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs
Summary
The paper presents VCPO, a method to stabilize off-policy reinforcement learning for large language models, addressing high variance issues in asynchronous training.
Why It Matters
Asynchronous reinforcement learning can enhance the efficiency of training large language models, but it often leads to unstable learning due to high variance. This research proposes a solution that could improve the reliability and speed of training, which is crucial for advancing AI applications.
Key Takeaways
- VCPO reduces variance in off-policy reinforcement learning, enhancing stability.
- The method scales learning rates based on effective sample size to mitigate unreliable updates.
- Empirical results show VCPO improves performance across various reasoning tasks.
- The approach can reduce training time significantly while maintaining synchronous performance.
- Effective control of policy-gradient variance is essential for reliable asynchronous training.
Computer Science > Machine Learning arXiv:2602.17616 (cs) [Submitted on 19 Feb 2026] Title:Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs Authors:Luke Huang, Zhuoyang Zhang, Qinghao Hu, Shang Yang, Song Han View a PDF of the paper titled Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs, by Luke Huang and 4 other authors View PDF HTML (experimental) Abstract:Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks, and asynchronous RL training is attractive because it increases end-to-end throughput. However, for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly $\textbf{higher variance}$: training on stale rollouts creates heavy-tailed importance ratios, causing a small fraction of samples to dominate updates. This amplification makes gradients noisy and learning unstable relative to matched on-policy training. Across math and general reasoning benchmarks, we find collapse is reliably predicted by effective sample size (ESS) and unstable gradient norms. Motivated by this diagnosis, we propose $\textbf{V}$ariance $\textbf{C}$ontrolled $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{VCPO}$), a general stabilization method for REINFORCE/GRPO-style algorithms that (i) scales learning rate based on effective sample size to dampen unreliable updates, and (ii) applies a closed-form minimum-variance baseline for the off-pol...