[2602.13407] On-Policy Supervised Fine-Tuning for Efficient Reasoning
Summary
The paper presents a novel training strategy called on-policy supervised fine-tuning (SFT) for large reasoning models, simplifying the optimization process while enhancing efficiency and accuracy.
Why It Matters
This research addresses the inefficiencies of current reinforcement learning methods in training large reasoning models by proposing a simplified approach that maintains performance while significantly reducing computational costs. This has implications for AI development, making advanced reasoning capabilities more accessible.
Key Takeaways
- On-policy SFT simplifies the training process for large reasoning models.
- The new method reduces computational costs by lowering GPU memory usage by 50%.
- It accelerates convergence rates by 70% compared to traditional RL methods.
- The approach maintains accuracy while reducing output length by up to 80%.
- Simplifying the reward structure leads to more stable training outcomes.
Computer Science > Artificial Intelligence arXiv:2602.13407 (cs) [Submitted on 13 Feb 2026] Title:On-Policy Supervised Fine-Tuning for Efficient Reasoning Authors:Anhao Zhao, Ziyang Chen, Junlong Tong, Yingqi Fan, Fanghua Ye, Shuhao Li, Yunpu Ma, Wenjie Li, Xiaoyu Shen View a PDF of the paper titled On-Policy Supervised Fine-Tuning for Efficient Reasoning, by Anhao Zhao and 8 other authors View PDF HTML (experimental) Abstract:Large reasoning models (LRMs) are commonly trained with reinforcement learning (RL) to explore long chain-of-thought reasoning, achieving strong performance at high computational cost. Recent methods add multi-reward objectives to jointly optimize correctness and brevity, but these complex extensions often destabilize training and yield suboptimal trade-offs. We revisit this objective and challenge the necessity of such complexity. Through principled analysis, we identify fundamental misalignments in this paradigm: KL regularization loses its intended role when correctness and length are directly verifiable, and group-wise normalization becomes ambiguous under multiple reward signals. By removing these two items and simplifying the reward to a truncation-based length penalty, we show that the optimization problem reduces to supervised fine-tuning on self-generated data filtered for both correctness and conciseness. We term this simplified training strategy on-policy SFT. Despite its simplicity, on-policy SFT consistently defines the accuracy-efficien...