[2602.17550] MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning
Summary
The paper presents MASPO, a novel framework that addresses inefficiencies in existing Reinforcement Learning with Verifiable Rewards (RLVR) algorithms for Large Language Models (LLMs) by enhancing gradient utilization, probability mass, and signal reliability.
Why It Matters
This research is significant as it tackles fundamental limitations in current RLVR methods, which are crucial for improving the performance and efficiency of LLMs. By proposing a unified approach, MASPO could lead to advancements in AI applications that rely on robust reasoning capabilities.
Key Takeaways
- MASPO addresses three main challenges in RLVR: gradient utilization, probability mass, and signal reliability.
- The framework employs a differentiable soft Gaussian gating mechanism to optimize gradient utility.
- Extensive evaluations show that MASPO significantly outperforms existing RLVR baselines.
Computer Science > Machine Learning arXiv:2602.17550 (cs) [Submitted on 19 Feb 2026] Title:MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning Authors:Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Binbin Zheng, Chaowen Hu, Zekai Shao, Cong Qin, Lu Pan, Ke Zeng, Xunliang Cai View a PDF of the paper titled MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning, by Xiaoliang Fu and 9 other authors View PDF HTML (experimental) Abstract:Existing Reinforcement Learning with Verifiable Rewards (RLVR) algorithms, such as GRPO, rely on rigid, uniform, and symmetric trust region mechanisms that are fundamentally misaligned with the complex optimization dynamics of Large Language Models (LLMs). In this paper, we identify three critical challenges in these methods: (1) inefficient gradient utilization caused by the binary cutoff of hard clipping, (2) insensitive probability mass arising from uniform ratio constraints that ignore the token distribution, and (3) asymmetric signal reliability stemming from the disparate credit assignment ambiguity between positive and negative samples. To bridge these gaps, we propose Mass-Adaptive Soft Policy Optimization (MASPO), a unified framework designed to harmonize these three dimensions. MASPO integrates a differentiable soft Gaussian gating to maximize gradient utility, a mass-adaptive limiter to balance explora...