[2604.08865] SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks
About this article
Abstract page for arXiv paper 2604.08865: SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks
Computer Science > Artificial Intelligence arXiv:2604.08865 (cs) [Submitted on 10 Apr 2026] Title:SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks Authors:Tianyi Wang, Yixia Li, Long Li, Yibiao Chen, Shaohan Huang, Yun Chen, Peng Li, Yang Liu, Guanhua Chen View a PDF of the paper titled SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks, by Tianyi Wang and 8 other authors View PDF HTML (experimental) Abstract:Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput. In this paper, we introduce Sequence-Level PPO (SPPO), a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates. SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling. Extensive experiments on mathematical benchmarks demonstrate that SPPO significantly surpasses standard PPO and matches the performance of computation-heavy group-b...