Machine Learning Ai Safety Ai Agents Nlp

[2602.22556] Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation

arXiv - AI February 27, 2026 3 min read Article

Summary

The paper presents a two-stage framework for enhancing large reasoning models (LRMs) by addressing overthinking in low-complexity queries through Hybrid Fine-Tuning and adaptive reinforcement learning techniques.

Why It Matters

This research is significant as it tackles common challenges in machine learning related to reasoning efficiency and accuracy. By improving LRM performance, it has implications for various applications in AI, particularly in natural language processing and decision-making systems.

Key Takeaways

Introduces a two-stage framework to improve reasoning in LRMs.
Utilizes Hybrid Fine-Tuning to balance thinking behaviors.
Implements Correctness-Preserving Advantage Shaping for enhanced accuracy.
Demonstrates significant reductions in token generation while improving accuracy.
Confirms robustness across varying problem difficulties and tasks.

Computer Science > Machine Learning arXiv:2602.22556 (cs) [Submitted on 26 Feb 2026] Title:Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation Authors:Zihang Xu, Haozhi Xie, Ziqi Miao, Wuxuan Gong, Chen Qian, Lijun Li View a PDF of the paper titled Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation, by Zihang Xu and 5 other authors View PDF HTML (experimental) Abstract:Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries. Existing efforts to mitigate this issue are fundamentally limited by unstable accuracy-efficiency trade-offs and poor robustness to heterogeneous reasoning behaviors. To address these challenges, we propose a two-stage framework for stable adaptive thinking in LRMs. The framework first applies Hybrid Fine-Tuning to expose the model to both thinking and no-thinking behaviors, establishing well-conditioned initialization. It then performs adaptive reinforcement learning with Correctness-Preserving Advantage Shaping (CPAS) to avoid suppressing correct long-chain reasoning, and Length-Aware Gradient Regulation (LAGR) to stabilize optimization under severe reasoning-length heterogeneity. Extensive experiments on Qwen2.5-1.5B and 7B show consistent improvements over strong baselines, achieving up to +3.7/+3.6 accuracy points while reducing generated tokens by 40.6%/43.9%. Further analyses...

Read Original Article

[2602.22556] Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation

Summary

Why It Matters

Key Takeaways

Related Articles

AI Has Flooded All the Weather Apps | WIRED

What I learned about multi-agent coordination running 9 specialized Claude agents

The AI Chip War is Just Getting Started

Exclusive: Runway launches $10M fund, Builders program to support early stage AI startups | TechCrunch

No comments

Stay updated with AI News