[2603.23871] HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation
About this article
Abstract page for arXiv paper 2603.23871: HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation
Computer Science > Machine Learning arXiv:2603.23871 (cs) [Submitted on 25 Mar 2026] Title:HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation Authors:Ken Ding View a PDF of the paper titled HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation, by Ken Ding View PDF HTML (experimental) Abstract:Large language models trained with reinforcement learning (RL) for mathematical reasoning face a fundamental challenge: on problems the model cannot solve at all - "cliff" prompts - the RL gradient vanishes entirely, preventing any learning signal from reaching these failure modes. We introduce Hybrid Distillation Policy Optimization (HDPO), which augments standard RL with privileged self-distillation targeting cliff prompts. On each training step, HDPO identifies prompts where all rollouts fail, generates privileged rollouts by providing the model with ground-truth information, filters for correct solutions, and distills the teacher's token-level distribution into the student. Because teacher and student share the same weights - differing only in their input - the realizability gap is provably bounded, unlike cross-model distillation. We prove that R=1 filtered privileged generation recovers the optimal KL-regularized RL policy in the hard-threshold limit. Experiments on OpenMathInstruct-2 with Qwen2.5-Math-1.5B-Instruct show that HDPO consistently improves coverage metrics (pass@4 by +0.8-1.1%, pass@8 by +0.4-1.7%) while maintain...