[2512.00499] ESPO: Entropy Importance Sampling Policy Optimization
Summary
The paper presents ESPO, a novel framework for optimizing reinforcement learning in large language models, addressing training stability and efficiency through entropy-based techniques.
Why It Matters
As reinforcement learning becomes integral to enhancing large language models, the ESPO framework offers a solution to the trade-off between training stability and efficiency. By improving gradient utilization, it can lead to better performance on complex reasoning tasks, which is crucial for advancing AI capabilities.
Key Takeaways
- ESPO combines fine-grained updates with stable training through entropy-based methods.
- The framework addresses gradient underutilization, enhancing training efficiency.
- Extensive experiments show ESPO accelerates convergence and improves accuracy on mathematical reasoning benchmarks.
Computer Science > Machine Learning arXiv:2512.00499 (cs) [Submitted on 29 Nov 2025 (v1), last revised 15 Feb 2026 (this version, v2)] Title:ESPO: Entropy Importance Sampling Policy Optimization Authors:Yuepeng Sheng, Yuwei Huang, Shuman Liu, Anxiang Zeng, Haibo Zhang View a PDF of the paper titled ESPO: Entropy Importance Sampling Policy Optimization, by Yuepeng Sheng and 3 other authors View PDF HTML (experimental) Abstract:Reinforcement learning (RL) has become a central component of post-training for large language models (LLMs), particularly for complex reasoning tasks that require stable optimization over long generation horizons. However, achieving performance at scale often introduces a fundamental trade-off between training stability and training efficiency. Token-level optimization applies fine-grained updates at the individual units, but is prone to high variance in gradient estimation, which can result in unstable training dynamics. In contrast, Sequence-level optimization often relies on aggressive clipping mechanisms to ensure stable updates. However, such design may discard a large fraction of valid training samples, leading to inefficient gradient utilization and reduced training efficiency. We refer to this phenomenon as gradient underutilization. In this work, we propose Entropy Importance Sampling Policy Optimization (ESPO), a novel framework that aims to combine fine-grained updates with stable training. ESPO decomposes sequences into groups based on pr...