[2602.22556] Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation
Summary
The paper presents a two-stage framework for enhancing large reasoning models (LRMs) by addressing overthinking in low-complexity queries through Hybrid Fine-Tuning and adaptive reinforcement learning techniques.
Why It Matters
This research is significant as it tackles common challenges in machine learning related to reasoning efficiency and accuracy. By improving LRM performance, it has implications for various applications in AI, particularly in natural language processing and decision-making systems.
Key Takeaways
- Introduces a two-stage framework to improve reasoning in LRMs.
- Utilizes Hybrid Fine-Tuning to balance thinking behaviors.
- Implements Correctness-Preserving Advantage Shaping for enhanced accuracy.
- Demonstrates significant reductions in token generation while improving accuracy.
- Confirms robustness across varying problem difficulties and tasks.
Computer Science > Machine Learning arXiv:2602.22556 (cs) [Submitted on 26 Feb 2026] Title:Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation Authors:Zihang Xu, Haozhi Xie, Ziqi Miao, Wuxuan Gong, Chen Qian, Lijun Li View a PDF of the paper titled Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation, by Zihang Xu and 5 other authors View PDF HTML (experimental) Abstract:Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries. Existing efforts to mitigate this issue are fundamentally limited by unstable accuracy-efficiency trade-offs and poor robustness to heterogeneous reasoning behaviors. To address these challenges, we propose a two-stage framework for stable adaptive thinking in LRMs. The framework first applies Hybrid Fine-Tuning to expose the model to both thinking and no-thinking behaviors, establishing well-conditioned initialization. It then performs adaptive reinforcement learning with Correctness-Preserving Advantage Shaping (CPAS) to avoid suppressing correct long-chain reasoning, and Length-Aware Gradient Regulation (LAGR) to stabilize optimization under severe reasoning-length heterogeneity. Extensive experiments on Qwen2.5-1.5B and 7B show consistent improvements over strong baselines, achieving up to +3.7/+3.6 accuracy points while reducing generated tokens by 40.6%/43.9%. Further analyses...