[2602.12125] Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

[2602.12125] Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

arXiv - Machine Learning 4 min read Article

Summary

The paper presents Generalized On-Policy Distillation (G-OPD), an advancement in machine learning that enhances student model performance by optimizing reward scaling and utilizing flexible reference models in on-policy distillation.

Why It Matters

This research addresses the limitations of traditional on-policy distillation methods by introducing a framework that allows for improved performance in machine learning tasks, particularly in scenarios where student models can exceed teacher performance. It has implications for various applications in AI, including reinforcement learning and model training.

Key Takeaways

  • G-OPD extends on-policy distillation by introducing a flexible reference model and reward scaling.
  • Reward extrapolation (ExOPD) can lead to better performance than standard OPD.
  • Combining knowledge from multiple domain experts can enhance student model capabilities.
  • The choice of reference model impacts the accuracy of the reward signal in distillation.
  • The findings suggest new avenues for research in optimizing model training strategies.

Computer Science > Machine Learning arXiv:2602.12125 (cs) [Submitted on 12 Feb 2026 (v1), last revised 26 Feb 2026 (this version, v2)] Title:Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation Authors:Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, Yankai Lin View a PDF of the paper titled Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation, by Wenkai Yang and 5 other authors View PDF HTML (experimental) Abstract:On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any model. Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization. Through comprehensive experiments on math reasoning and code generation tasks, we derive two novel insights: (1) Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, c...

Related Articles

Machine Learning

[R] Are there ML approaches for prioritizing and routing “important” signals across complex systems?

I’ve been reading more about attention mechanisms in transformers and how they effectively learn to weight and prioritize relevant inputs...

Reddit - Machine Learning · 1 min ·
Llms

[P] I trained a language model from scratch for a low resource language and got it running fully on-device on Android (no GPU, demo)

Hi Everybody! I just wanted to share an update on a project I’ve been working on called BULaMU, a family of language models trained (20M,...

Reddit - Machine Learning · 1 min ·
Machine Learning

[R] Structure Over Scale: Memory-First Reasoning and Depth-Pruned Efficiency in Magnus and Seed Architecture Auto-Discovery

Dataset Model Acc F1 Δ vs Log Δ vs Static Avg Params Peak Params Steps Infer ms Size Banking77-20 Logistic TF-IDF 92.37% 0.9230 +0.00pp +...

Reddit - Machine Learning · 1 min ·
UM Computer Scientists Land Grant to Improve Models of Melting Greenland Glaciers
Machine Learning

UM Computer Scientists Land Grant to Improve Models of Melting Greenland Glaciers

Two UM researchers are using advanced neural networks, machine learning and artificial intelligence to improve climate models to better p...

AI News - General · 5 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime