Machine Learning Ai Agents

[2602.12125] Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

arXiv - Machine Learning February 27, 2026 4 min read Article

Summary

The paper presents Generalized On-Policy Distillation (G-OPD), an advancement in machine learning that enhances student model performance by optimizing reward scaling and utilizing flexible reference models in on-policy distillation.

Why It Matters

This research addresses the limitations of traditional on-policy distillation methods by introducing a framework that allows for improved performance in machine learning tasks, particularly in scenarios where student models can exceed teacher performance. It has implications for various applications in AI, including reinforcement learning and model training.

Key Takeaways

G-OPD extends on-policy distillation by introducing a flexible reference model and reward scaling.
Reward extrapolation (ExOPD) can lead to better performance than standard OPD.
Combining knowledge from multiple domain experts can enhance student model capabilities.
The choice of reference model impacts the accuracy of the reward signal in distillation.
The findings suggest new avenues for research in optimizing model training strategies.

Computer Science > Machine Learning arXiv:2602.12125 (cs) [Submitted on 12 Feb 2026 (v1), last revised 26 Feb 2026 (this version, v2)] Title:Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation Authors:Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, Yankai Lin View a PDF of the paper titled Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation, by Wenkai Yang and 5 other authors View PDF HTML (experimental) Abstract:On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any model. Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization. Through comprehensive experiments on math reasoning and code generation tasks, we derive two novel insights: (1) Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, c...

Read Original Article

[2602.12125] Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Summary

Why It Matters

Key Takeaways

Related Articles

[R] Are there ML approaches for prioritizing and routing “important” signals across complex systems?

[P] I trained a language model from scratch for a low resource language and got it running fully on-device on Android (no GPU, demo)

[R] Structure Over Scale: Memory-First Reasoning and Depth-Pruned Efficiency in Magnus and Seed Architecture Auto-Discovery

UM Computer Scientists Land Grant to Improve Models of Melting Greenland Glaciers

No comments

Stay updated with AI News