Llms Machine Learning Ai Safety Generative Ai

[2602.15620] STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

arXiv - AI February 18, 2026 4 min read Article

Summary

The paper presents STAPO, a novel approach to stabilize reinforcement learning in large language models by silencing rare spurious tokens that cause training instability.

Why It Matters

This research addresses a critical issue in reinforcement learning for large language models, where spurious tokens can lead to performance collapse. By proposing a method to mitigate this effect, STAPO enhances model stability and reasoning quality, which is essential for advancing AI applications.

Key Takeaways

STAPO targets spurious tokens that negatively impact reinforcement learning stability.
The method improves reasoning performance by an average of 7.13% across multiple benchmarks.
Training instability is linked to a small fraction of tokens, which STAPO effectively silences.
The research provides a new framework for refining large-scale language models.
STAPO demonstrates superior entropy stability compared to existing methods.

Computer Science > Computation and Language arXiv:2602.15620 (cs) [Submitted on 17 Feb 2026] Title:STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens Authors:Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng, Jiang Wu, Yinuo Wang, Yang Guan, Kehua Sheng, Bo Zhang, Keqiang Li, Jingliang Duan, Shengbo Eben Li View a PDF of the paper titled STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens, by Shiqi Liu and 12 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often experience late-stage performance collapse, leading to degraded reasoning quality and unstable training. We derive that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. Building on this result, we prove that training instability is driven by a tiny fraction of tokens, approximately 0.01\%, which we term \emph{spurious tokens}. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. Motivated by this observation, we propose Spurious-Token-Aware Policy Optimization (STAPO) for large-scale m...

Read Original Article