[2602.15260] Fast and Effective On-policy Distillation from Reasoning Prefixes
Summary
This paper presents an innovative approach to on-policy distillation (OPD) in machine learning, focusing on the effective use of reasoning prefixes to enhance training efficiency and model performance.
Why It Matters
The research addresses the high computational costs associated with traditional OPD methods by introducing a prefix-based distillation strategy. This advancement could significantly lower resource requirements while maintaining performance, making it relevant for researchers and practitioners in machine learning and AI.
Key Takeaways
- On-policy distillation (OPD) can improve generalization over off-policy methods.
- Training signals are concentrated in the prefixes of outputs, allowing for effective early termination during distillation.
- The proposed prefix distillation method reduces training costs by 2x-47x without sacrificing performance.
Computer Science > Machine Learning arXiv:2602.15260 (cs) [Submitted on 16 Feb 2026] Title:Fast and Effective On-policy Distillation from Reasoning Prefixes Authors:Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D. Lyng, Sanjit Singh Batra, Robert E. Tillman View a PDF of the paper titled Fast and Effective On-policy Distillation from Reasoning Prefixes, by Dongxu Zhang and 8 other authors View PDF Abstract:On-policy distillation (OPD), which samples trajectories from the student model and supervises them with a teacher at the token level, avoids relying solely on verifiable terminal rewards and can yield better generalization than off-policy distillation. However, OPD requires expensive on-the-fly sampling of the student policy during training, which substantially increases training cost, especially for long responses. Our initial analysis shows that, during OPD, training signals are often concentrated in the prefix of each output, and that even a short teacher-generated prefix can significantly help the student produce the correct answer. Motivated by these observations, we propose a simple yet effective modification of OPD: we apply the distillation objective only to prefixes of student-generated outputs and terminate each sampling early during distillation. Experiments on a suite of AI-for-Math and out-of-domain benchmarks show that on-policy prefix distillation matches the performance of full OPD while reducing training ...