[2602.15620] STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

[2602.15620] STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

arXiv - AI 4 min read Article

Summary

The paper presents STAPO, a novel approach to stabilize reinforcement learning in large language models by silencing rare spurious tokens that cause training instability.

Why It Matters

This research addresses a critical issue in reinforcement learning for large language models, where spurious tokens can lead to performance collapse. By proposing a method to mitigate this effect, STAPO enhances model stability and reasoning quality, which is essential for advancing AI applications.

Key Takeaways

  • STAPO targets spurious tokens that negatively impact reinforcement learning stability.
  • The method improves reasoning performance by an average of 7.13% across multiple benchmarks.
  • Training instability is linked to a small fraction of tokens, which STAPO effectively silences.
  • The research provides a new framework for refining large-scale language models.
  • STAPO demonstrates superior entropy stability compared to existing methods.

Computer Science > Computation and Language arXiv:2602.15620 (cs) [Submitted on 17 Feb 2026] Title:STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens Authors:Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng, Jiang Wu, Yinuo Wang, Yang Guan, Kehua Sheng, Bo Zhang, Keqiang Li, Jingliang Duan, Shengbo Eben Li View a PDF of the paper titled STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens, by Shiqi Liu and 12 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often experience late-stage performance collapse, leading to degraded reasoning quality and unstable training. We derive that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. Building on this result, we prove that training instability is driven by a tiny fraction of tokens, approximately 0.01\%, which we term \emph{spurious tokens}. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. Motivated by this observation, we propose Spurious-Token-Aware Policy Optimization (STAPO) for large-scale m...

Related Articles

Llms

Claude Max 20x usage hit 40% by Monday noon — how does Codex CLI compare?

I'm on Claude Max (the $100/mo plan) and noticed something that surprised me. By Monday noon I had already used 40% of the 20x monthly li...

Reddit - Artificial Intelligence · 1 min ·
How to use the new ChatGPT app integrations, including DoorDash, Spotify, Uber, and others | TechCrunch
Llms

How to use the new ChatGPT app integrations, including DoorDash, Spotify, Uber, and others | TechCrunch

Learn how to use Spotify, Canva, Figma, Expedia, and other apps directly in ChatGPT.

TechCrunch - AI · 10 min ·
Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto
Llms

Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto

AI Tools & Products · 7 min ·
Is cutting ‘please’ when talking to ChatGPT better for the planet? An expert explains
Llms

Is cutting ‘please’ when talking to ChatGPT better for the planet? An expert explains

AI Tools & Products · 5 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime