[2602.17550] MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning

[2602.17550] MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning

arXiv - Machine Learning 4 min read Article

Summary

The paper presents MASPO, a novel framework that addresses inefficiencies in existing Reinforcement Learning with Verifiable Rewards (RLVR) algorithms for Large Language Models (LLMs) by enhancing gradient utilization, probability mass, and signal reliability.

Why It Matters

This research is significant as it tackles fundamental limitations in current RLVR methods, which are crucial for improving the performance and efficiency of LLMs. By proposing a unified approach, MASPO could lead to advancements in AI applications that rely on robust reasoning capabilities.

Key Takeaways

  • MASPO addresses three main challenges in RLVR: gradient utilization, probability mass, and signal reliability.
  • The framework employs a differentiable soft Gaussian gating mechanism to optimize gradient utility.
  • Extensive evaluations show that MASPO significantly outperforms existing RLVR baselines.

Computer Science > Machine Learning arXiv:2602.17550 (cs) [Submitted on 19 Feb 2026] Title:MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning Authors:Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Binbin Zheng, Chaowen Hu, Zekai Shao, Cong Qin, Lu Pan, Ke Zeng, Xunliang Cai View a PDF of the paper titled MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning, by Xiaoliang Fu and 9 other authors View PDF HTML (experimental) Abstract:Existing Reinforcement Learning with Verifiable Rewards (RLVR) algorithms, such as GRPO, rely on rigid, uniform, and symmetric trust region mechanisms that are fundamentally misaligned with the complex optimization dynamics of Large Language Models (LLMs). In this paper, we identify three critical challenges in these methods: (1) inefficient gradient utilization caused by the binary cutoff of hard clipping, (2) insensitive probability mass arising from uniform ratio constraints that ignore the token distribution, and (3) asymmetric signal reliability stemming from the disparate credit assignment ambiguity between positive and negative samples. To bridge these gaps, we propose Mass-Adaptive Soft Policy Optimization (MASPO), a unified framework designed to harmonize these three dimensions. MASPO integrates a differentiable soft Gaussian gating to maximize gradient utility, a mass-adaptive limiter to balance explora...

Related Articles

[2604.01676] GPA: Learning GUI Process Automation from Demonstrations
Llms

[2604.01676] GPA: Learning GUI Process Automation from Demonstrations

Abstract page for arXiv paper 2604.01676: GPA: Learning GUI Process Automation from Demonstrations

arXiv - AI · 3 min ·
[2604.01413] Adaptive Stopping for Multi-Turn LLM Reasoning
Llms

[2604.01413] Adaptive Stopping for Multi-Turn LLM Reasoning

Abstract page for arXiv paper 2604.01413: Adaptive Stopping for Multi-Turn LLM Reasoning

arXiv - AI · 4 min ·
[2603.11749] Truth as a Compression Artifact in Language Model Training
Llms

[2603.11749] Truth as a Compression Artifact in Language Model Training

Abstract page for arXiv paper 2603.11749: Truth as a Compression Artifact in Language Model Training

arXiv - AI · 4 min ·
[2603.10047] Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction
Llms

[2603.10047] Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction

Abstract page for arXiv paper 2603.10047: Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination ...

arXiv - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime