[2602.17616] Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

[2602.17616] Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

arXiv - Machine Learning 4 min read Article

Summary

The paper presents VCPO, a method to stabilize off-policy reinforcement learning for large language models, addressing high variance issues in asynchronous training.

Why It Matters

Asynchronous reinforcement learning can enhance the efficiency of training large language models, but it often leads to unstable learning due to high variance. This research proposes a solution that could improve the reliability and speed of training, which is crucial for advancing AI applications.

Key Takeaways

  • VCPO reduces variance in off-policy reinforcement learning, enhancing stability.
  • The method scales learning rates based on effective sample size to mitigate unreliable updates.
  • Empirical results show VCPO improves performance across various reasoning tasks.
  • The approach can reduce training time significantly while maintaining synchronous performance.
  • Effective control of policy-gradient variance is essential for reliable asynchronous training.

Computer Science > Machine Learning arXiv:2602.17616 (cs) [Submitted on 19 Feb 2026] Title:Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs Authors:Luke Huang, Zhuoyang Zhang, Qinghao Hu, Shang Yang, Song Han View a PDF of the paper titled Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs, by Luke Huang and 4 other authors View PDF HTML (experimental) Abstract:Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks, and asynchronous RL training is attractive because it increases end-to-end throughput. However, for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly $\textbf{higher variance}$: training on stale rollouts creates heavy-tailed importance ratios, causing a small fraction of samples to dominate updates. This amplification makes gradients noisy and learning unstable relative to matched on-policy training. Across math and general reasoning benchmarks, we find collapse is reliably predicted by effective sample size (ESS) and unstable gradient norms. Motivated by this diagnosis, we propose $\textbf{V}$ariance $\textbf{C}$ontrolled $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{VCPO}$), a general stabilization method for REINFORCE/GRPO-style algorithms that (i) scales learning rate based on effective sample size to dampen unreliable updates, and (ii) applies a closed-form minimum-variance baseline for the off-pol...

Related Articles

[2604.01676] GPA: Learning GUI Process Automation from Demonstrations
Llms

[2604.01676] GPA: Learning GUI Process Automation from Demonstrations

Abstract page for arXiv paper 2604.01676: GPA: Learning GUI Process Automation from Demonstrations

arXiv - AI · 3 min ·
[2604.01413] Adaptive Stopping for Multi-Turn LLM Reasoning
Llms

[2604.01413] Adaptive Stopping for Multi-Turn LLM Reasoning

Abstract page for arXiv paper 2604.01413: Adaptive Stopping for Multi-Turn LLM Reasoning

arXiv - AI · 4 min ·
[2603.11749] Truth as a Compression Artifact in Language Model Training
Llms

[2603.11749] Truth as a Compression Artifact in Language Model Training

Abstract page for arXiv paper 2603.11749: Truth as a Compression Artifact in Language Model Training

arXiv - AI · 4 min ·
[2603.10047] Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction
Llms

[2603.10047] Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction

Abstract page for arXiv paper 2603.10047: Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination ...

arXiv - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime