[2604.00626] A Survey of On-Policy Distillation for Large Language Models
About this article
Abstract page for arXiv paper 2604.00626: A Survey of On-Policy Distillation for Large Language Models
Computer Science > Machine Learning arXiv:2604.00626 (cs) [Submitted on 1 Apr 2026] Title:A Survey of On-Policy Distillation for Large Language Models Authors:Mingyang Song, Mao Zheng View a PDF of the paper titled A Survey of On-Policy Distillation for Large Language Models, by Mingyang Song and 1 other authors View PDF HTML (experimental) Abstract:Knowledge distillation has become a primary mechanism for transferring reasoning and domain expertise from frontier Large Language Models (LLMs) to smaller, deployable students. However, the dominant paradigm remains \textit{off-policy}: students train on static teacher-generated data and never encounter their own errors during learning. This train--test mismatch, an instance of \textit{exposure bias}, causes prediction errors to compound autoregressively at inference time. On-Policy Distillation (OPD) addresses this by letting the student generate its own trajectories and receive teacher feedback on these self-generated outputs, grounding distillation in the theory of interactive imitation learning. Despite rapid growth spanning divergence minimization, reward-guided learning, and self-play, the OPD literature remains fragmented with no unified treatment. This survey provides the first comprehensive overview of OPD for LLMs. We introduce a unified $f$-divergence framework over on-policy samples and organize the landscape along three orthogonal dimensions: \emph{feedback signal} (logit-based, outcome-based, or self-play), \emph...