[2604.08865] SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

arXiv - AI April 13, 2026 3 min read

About this article

Abstract page for arXiv paper 2604.08865: SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Computer Science > Artificial Intelligence arXiv:2604.08865 (cs) [Submitted on 10 Apr 2026] Title:SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks Authors:Tianyi Wang, Yixia Li, Long Li, Yibiao Chen, Shaohan Huang, Yun Chen, Peng Li, Yang Liu, Guanhua Chen View a PDF of the paper titled SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks, by Tianyi Wang and 8 other authors View PDF HTML (experimental) Abstract:Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput. In this paper, we introduce Sequence-Level PPO (SPPO), a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates. SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling. Extensive experiments on mathematical benchmarks demonstrate that SPPO significantly surpasses standard PPO and matches the performance of computation-heavy group-b...

Originally published on April 13, 2026. Curated by AI News.

Llms

Transformer Math Explorer [P]

This is an interactive math reference for transformer models, presented via dataflow graphs, all the way down to elementary math. Covers ...

Reddit - Machine Learning · 1 min · about 1 hour ago

Llms

Spotify wants to become the home for AI-generated personal audio | TechCrunch

Users will be able to create a podcast from Codex or Claude Code and import it to Spotify

TechCrunch - AI · 3 min · about 1 hour ago

Llms

We built something ChatGPT doesn't do — AI that delivers results, not answers

Most AI gives you text. We built cards. Here's what I mean. When you ask LookMood Agent to find you a job, you don't get advice on where ...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Llms

I am not an "anti" like this guy, but still an interesting video of person interacting with chat 4o

(Posting Here because removed by Chatgpt Complaints moderators because the model here is 4o, and refuse to believe there were any safety ...

Reddit - Artificial Intelligence · 1 min · about 6 hours ago

[2604.08865] SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

About this article

Related Articles

Transformer Math Explorer [P]

Spotify wants to become the home for AI-generated personal audio | TechCrunch

We built something ChatGPT doesn't do — AI that delivers results, not answers

I am not an "anti" like this guy, but still an interesting video of person interacting with chat 4o

No comments

Stay updated with AI News