[2602.12125] Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Summary
The paper presents Generalized On-Policy Distillation (G-OPD), an advancement in machine learning that enhances student model performance by optimizing reward scaling and utilizing flexible reference models in on-policy distillation.
Why It Matters
This research addresses the limitations of traditional on-policy distillation methods by introducing a framework that allows for improved performance in machine learning tasks, particularly in scenarios where student models can exceed teacher performance. It has implications for various applications in AI, including reinforcement learning and model training.
Key Takeaways
- G-OPD extends on-policy distillation by introducing a flexible reference model and reward scaling.
- Reward extrapolation (ExOPD) can lead to better performance than standard OPD.
- Combining knowledge from multiple domain experts can enhance student model capabilities.
- The choice of reference model impacts the accuracy of the reward signal in distillation.
- The findings suggest new avenues for research in optimizing model training strategies.
Computer Science > Machine Learning arXiv:2602.12125 (cs) [Submitted on 12 Feb 2026 (v1), last revised 26 Feb 2026 (this version, v2)] Title:Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation Authors:Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, Yankai Lin View a PDF of the paper titled Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation, by Wenkai Yang and 5 other authors View PDF HTML (experimental) Abstract:On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any model. Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization. Through comprehensive experiments on math reasoning and code generation tasks, we derive two novel insights: (1) Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, c...