[2603.11178] PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence
About this article
Abstract page for arXiv paper 2603.11178: PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence
Computer Science > Artificial Intelligence arXiv:2603.11178 (cs) [Submitted on 11 Mar 2026 (v1), last revised 9 Apr 2026 (this version, v3)] Title:PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence Authors:Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang View a PDF of the paper titled PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence, by Yuanda Xu and 3 other authors View PDF HTML (experimental) Abstract:Standard LLM distillation treats all training problems equally -- wasting compute on problems the student has already mastered or cannot yet solve. We empirically show that this inefficiency has a precise gradient-level signature: the cross-problem gradient signal-to-noise ratio (SNR) follows a bell curve over student pass rate, collapsing at both extremes. We propose PACED, which weights each problem by $w(p) = p(1{-}p)$ where $p$ is the student's empirical pass rate -- concentrating training on the zone of proximal development. This requires only student rollouts, no architectural changes, and no hyperparameters. We prove the Beta kernel $w(p) = p^\alpha(1{-}p)^\beta$ is the leading-order optimal weight family arising from the SNR boundary-collapse structure, and is minimax-robust under misspecification (worst-case efficiency loss $O(\delta^2)$). Across Qwen3, Qwen2.5, and Llama-3 families, PACED sets a new state of the art in our experimental setting on MATH-500, AIME~2024, and AIME...