[2601.18734] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
About this article
Abstract page for arXiv paper 2601.18734: Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Computer Science > Machine Learning arXiv:2601.18734 (cs) [Submitted on 26 Jan 2026 (v1), last revised 5 Mar 2026 (this version, v2)] Title:Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models Authors:Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, Aditya Grover View a PDF of the paper titled Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models, by Siyan Zhao and 6 other authors View PDF HTML (experimental) Abstract:Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On-Policy Self-Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified...