[2602.15620] STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
Summary
The paper presents STAPO, a novel approach to stabilize reinforcement learning in large language models by silencing rare spurious tokens that cause training instability.
Why It Matters
This research addresses a critical issue in reinforcement learning for large language models, where spurious tokens can lead to performance collapse. By proposing a method to mitigate this effect, STAPO enhances model stability and reasoning quality, which is essential for advancing AI applications.
Key Takeaways
- STAPO targets spurious tokens that negatively impact reinforcement learning stability.
- The method improves reasoning performance by an average of 7.13% across multiple benchmarks.
- Training instability is linked to a small fraction of tokens, which STAPO effectively silences.
- The research provides a new framework for refining large-scale language models.
- STAPO demonstrates superior entropy stability compared to existing methods.
Computer Science > Computation and Language arXiv:2602.15620 (cs) [Submitted on 17 Feb 2026] Title:STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens Authors:Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng, Jiang Wu, Yinuo Wang, Yang Guan, Kehua Sheng, Bo Zhang, Keqiang Li, Jingliang Duan, Shengbo Eben Li View a PDF of the paper titled STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens, by Shiqi Liu and 12 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often experience late-stage performance collapse, leading to degraded reasoning quality and unstable training. We derive that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. Building on this result, we prove that training instability is driven by a tiny fraction of tokens, approximately 0.01\%, which we term \emph{spurious tokens}. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. Motivated by this observation, we propose Spurious-Token-Aware Policy Optimization (STAPO) for large-scale m...