[2603.25562] Revisiting On-Policy Distillation: Empirical Failure

[2603.25562] Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

arXiv - AI March 27, 2026 4 min read

About this article

Abstract page for arXiv paper 2603.25562: Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Computer Science > Machine Learning arXiv:2603.25562 (cs) [Submitted on 26 Mar 2026] Title:Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes Authors:Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, Dongbin Zhao View a PDF of the paper titled Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes, by Yuqian Fu and 4 other authors View PDF HTML (experimental) Abstract:On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-case variance bound; our toy study shows the same tradeoff empirically, with stronger future-reward coupling producing higher gradient variance and less stable learning. Empirically, we identify three failure modes of sampled-token OPD: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions caused by tokenizer or special-token mismatch. We address these issues with teacher top-K local support matching, implemented as trunc...

Originally published on March 27, 2026. Curated by AI News.

Llms

[D] Real-time Student Attention Detection: ResNet vs Facial Landmarks - Which approach for resource-constrained deployment?

I have a problem statement where we are supposed to detect the attention level of student in a classroom, basically output whether he is ...

Reddit - Machine Learning · 1 min · 15 minutes ago

Llms

[D] We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally wrong answers

Projects are still submitting new scores on LoCoMo as of March 2026. We audited it and found 6.4% of the answer key is wrong, and the LLM...

Reddit - Machine Learning · 1 min · 15 minutes ago

Llms

[P] ClaudeFormer: Building a Transformer Out of Claudes — Collaboration Request

I'm looking to work with people interested in math, machine learning, or agentic coding, on creating a multi-agent framework to do fronti...

Reddit - Machine Learning · 1 min · about 2 hours ago

Llms

I Asked ChatGPT 500 Questions. Here Are the Ads I Saw Most Often | WIRED

Ads are rolling out across the US on ChatGPT’s free tier. I asked OpenAI's bot 500 questions to see what these ads were like and how they...

Wired - AI · 9 min · about 4 hours ago

[2603.25562] Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

About this article

Related Articles

[D] Real-time Student Attention Detection: ResNet vs Facial Landmarks - Which approach for resource-constrained deployment?

[D] We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally wrong answers

[P] ClaudeFormer: Building a Transformer Out of Claudes — Collaboration Request

I Asked ChatGPT 500 Questions. Here Are the Ads I Saw Most Often | WIRED

No comments

Stay updated with AI News