[2602.22556] Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation

[2602.22556] Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation

arXiv - AI 3 min read Article

Summary

The paper presents a two-stage framework for enhancing large reasoning models (LRMs) by addressing overthinking in low-complexity queries through Hybrid Fine-Tuning and adaptive reinforcement learning techniques.

Why It Matters

This research is significant as it tackles common challenges in machine learning related to reasoning efficiency and accuracy. By improving LRM performance, it has implications for various applications in AI, particularly in natural language processing and decision-making systems.

Key Takeaways

  • Introduces a two-stage framework to improve reasoning in LRMs.
  • Utilizes Hybrid Fine-Tuning to balance thinking behaviors.
  • Implements Correctness-Preserving Advantage Shaping for enhanced accuracy.
  • Demonstrates significant reductions in token generation while improving accuracy.
  • Confirms robustness across varying problem difficulties and tasks.

Computer Science > Machine Learning arXiv:2602.22556 (cs) [Submitted on 26 Feb 2026] Title:Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation Authors:Zihang Xu, Haozhi Xie, Ziqi Miao, Wuxuan Gong, Chen Qian, Lijun Li View a PDF of the paper titled Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation, by Zihang Xu and 5 other authors View PDF HTML (experimental) Abstract:Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries. Existing efforts to mitigate this issue are fundamentally limited by unstable accuracy-efficiency trade-offs and poor robustness to heterogeneous reasoning behaviors. To address these challenges, we propose a two-stage framework for stable adaptive thinking in LRMs. The framework first applies Hybrid Fine-Tuning to expose the model to both thinking and no-thinking behaviors, establishing well-conditioned initialization. It then performs adaptive reinforcement learning with Correctness-Preserving Advantage Shaping (CPAS) to avoid suppressing correct long-chain reasoning, and Length-Aware Gradient Regulation (LAGR) to stabilize optimization under severe reasoning-length heterogeneity. Extensive experiments on Qwen2.5-1.5B and 7B show consistent improvements over strong baselines, achieving up to +3.7/+3.6 accuracy points while reducing generated tokens by 40.6%/43.9%. Further analyses...

Related Articles

AI Has Flooded All the Weather Apps | WIRED
Machine Learning

AI Has Flooded All the Weather Apps | WIRED

Weather forecasting has gotten a big boost from machine learning. How that translates into what users see can vary.

Wired - AI · 8 min ·
Llms

What I learned about multi-agent coordination running 9 specialized Claude agents

I've been experimenting with multi-agent AI systems and ended up building something more ambitious than I originally planned: a fully ope...

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

The AI Chip War is Just Getting Started

Everyone talks about AI models, but the real bottleneck might be hardware. According to a recent study by Roots Analysis: AI chip market ...

Reddit - Artificial Intelligence · 1 min ·
Exclusive: Runway launches $10M fund, Builders program to support early stage AI startups | TechCrunch
Machine Learning

Exclusive: Runway launches $10M fund, Builders program to support early stage AI startups | TechCrunch

Runway is launching a $10 million fund and startup program to back companies building with its AI video models, as it pushes toward inter...

TechCrunch - AI · 7 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime