[2602.10693] VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training
Summary
The paper introduces VESPO, a novel approach for stable off-policy training of large language models (LLMs) that addresses training stability issues caused by policy staleness and distribution shifts.
Why It Matters
Training stability in reinforcement learning is crucial for the effective deployment of large language models. VESPO offers a solution to common challenges like policy divergence and high variance in importance sampling, making it relevant for researchers and practitioners in AI and machine learning.
Key Takeaways
- VESPO addresses training stability issues in LLMs caused by policy staleness.
- The method incorporates variance reduction into a variational framework.
- Experiments show VESPO maintains stability under high staleness ratios and asynchronous execution.
- The approach provides consistent performance improvements across various model types.
- Code for VESPO is publicly available, promoting further research and application.
Computer Science > Machine Learning arXiv:2602.10693 (cs) [Submitted on 11 Feb 2026 (v1), last revised 24 Feb 2026 (this version, v2)] Title:VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training Authors:Guobin Shen, Chenxiao Zhao, Xiang Cheng, Lei Huang, Xing Yu View a PDF of the paper titled VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training, by Guobin Shen and 4 other authors View PDF HTML (experimental) Abstract:Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. Importance sampling provides a principled correction for this distribution shift but suffers from high variance; existing remedies such as token-level clipping and sequence-level normalization lack a unified theoretical foundation. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By incorporating variance reduction into a variational formulation over proposal distributions, VESPO derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without length normalization. Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, an...