[2602.10693] VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

[2602.10693] VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

arXiv - Machine Learning 3 min read Article

Summary

The paper introduces VESPO, a novel approach for stable off-policy training of large language models (LLMs) that addresses training stability issues caused by policy staleness and distribution shifts.

Why It Matters

Training stability in reinforcement learning is crucial for the effective deployment of large language models. VESPO offers a solution to common challenges like policy divergence and high variance in importance sampling, making it relevant for researchers and practitioners in AI and machine learning.

Key Takeaways

  • VESPO addresses training stability issues in LLMs caused by policy staleness.
  • The method incorporates variance reduction into a variational framework.
  • Experiments show VESPO maintains stability under high staleness ratios and asynchronous execution.
  • The approach provides consistent performance improvements across various model types.
  • Code for VESPO is publicly available, promoting further research and application.

Computer Science > Machine Learning arXiv:2602.10693 (cs) [Submitted on 11 Feb 2026 (v1), last revised 24 Feb 2026 (this version, v2)] Title:VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training Authors:Guobin Shen, Chenxiao Zhao, Xiang Cheng, Lei Huang, Xing Yu View a PDF of the paper titled VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training, by Guobin Shen and 4 other authors View PDF HTML (experimental) Abstract:Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. Importance sampling provides a principled correction for this distribution shift but suffers from high variance; existing remedies such as token-level clipping and sequence-level normalization lack a unified theoretical foundation. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By incorporating variance reduction into a variational formulation over proposal distributions, VESPO derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without length normalization. Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, an...

Related Articles

Llms

A robot car with a Claude AI brain started a YouTube vlog about its own existence

Not a demo reel. Not a tutorial. A robot narrating its own experience — debugging, falling off shelves, questioning its identity. First-p...

Reddit - Artificial Intelligence · 1 min ·
Llms

Study: LLMs Able to De-Anonymize User Accounts on Reddit, Hacker News & Other "Pseudonymous" Platforms; Report Co-Author Expands, Advises

Advice from the study's co-author: "Be aware that it’s not any single post that identifies you, but the combination of small details acro...

Reddit - Artificial Intelligence · 1 min ·
Llms

do you guys actually trust AI tools with your data?

idk if it’s just me but lately i’ve been thinking about how casually we use stuff like chatgpt and claude for everything like coding, ran...

Reddit - Artificial Intelligence · 1 min ·
Llms

[P] Remote sensing foundation models made easy to use.

This project enables the idea of tasking remote sensing models to acquire embeddings like we task satellites to acquire data! https://git...

Reddit - Machine Learning · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime