[2602.13407] On-Policy Supervised Fine-Tuning for Efficient Reasoning

[2602.13407] On-Policy Supervised Fine-Tuning for Efficient Reasoning

arXiv - AI 4 min read Article

Summary

The paper presents a novel training strategy called on-policy supervised fine-tuning (SFT) for large reasoning models, simplifying the optimization process while enhancing efficiency and accuracy.

Why It Matters

This research addresses the inefficiencies of current reinforcement learning methods in training large reasoning models by proposing a simplified approach that maintains performance while significantly reducing computational costs. This has implications for AI development, making advanced reasoning capabilities more accessible.

Key Takeaways

  • On-policy SFT simplifies the training process for large reasoning models.
  • The new method reduces computational costs by lowering GPU memory usage by 50%.
  • It accelerates convergence rates by 70% compared to traditional RL methods.
  • The approach maintains accuracy while reducing output length by up to 80%.
  • Simplifying the reward structure leads to more stable training outcomes.

Computer Science > Artificial Intelligence arXiv:2602.13407 (cs) [Submitted on 13 Feb 2026] Title:On-Policy Supervised Fine-Tuning for Efficient Reasoning Authors:Anhao Zhao, Ziyang Chen, Junlong Tong, Yingqi Fan, Fanghua Ye, Shuhao Li, Yunpu Ma, Wenjie Li, Xiaoyu Shen View a PDF of the paper titled On-Policy Supervised Fine-Tuning for Efficient Reasoning, by Anhao Zhao and 8 other authors View PDF HTML (experimental) Abstract:Large reasoning models (LRMs) are commonly trained with reinforcement learning (RL) to explore long chain-of-thought reasoning, achieving strong performance at high computational cost. Recent methods add multi-reward objectives to jointly optimize correctness and brevity, but these complex extensions often destabilize training and yield suboptimal trade-offs. We revisit this objective and challenge the necessity of such complexity. Through principled analysis, we identify fundamental misalignments in this paradigm: KL regularization loses its intended role when correctness and length are directly verifiable, and group-wise normalization becomes ambiguous under multiple reward signals. By removing these two items and simplifying the reward to a truncation-based length penalty, we show that the optimization problem reduces to supervised fine-tuning on self-generated data filtered for both correctness and conciseness. We term this simplified training strategy on-policy SFT. Despite its simplicity, on-policy SFT consistently defines the accuracy-efficien...

Related Articles

UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
Alabama A&M University chosen for Amazon Web Services AI training program
Machine Learning

Alabama A&M University chosen for Amazon Web Services AI training program

Alabama A&M University has been selected as one of just five institutions nationwide to participate in Amazon Web Services' Machine Learn...

AI News - General · 2 min ·
Interpretable machine learning model advances analysis of complex genetic traits
Machine Learning

Interpretable machine learning model advances analysis of complex genetic traits

A new study published in Genome Research presents an interpretable artificial intelligence framework that improves both the accuracy and ...

AI News - General · 6 min ·
Sam Altman's Coworkers Say He Can Barely Code and Misunderstands Basic Machine Learning Concepts
Machine Learning

Sam Altman's Coworkers Say He Can Barely Code and Misunderstands Basic Machine Learning Concepts

The OpenAI CEO reportedly confuses basic coding and machine learning terms, numerous insiders have admitted.

AI News - General · 2 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime