[2603.23871] HDPO: Hybrid Distillation Policy Optimization via

[2603.23871] HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation

arXiv - Machine Learning March 26, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.23871: HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation

Computer Science > Machine Learning arXiv:2603.23871 (cs) [Submitted on 25 Mar 2026] Title:HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation Authors:Ken Ding View a PDF of the paper titled HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation, by Ken Ding View PDF HTML (experimental) Abstract:Large language models trained with reinforcement learning (RL) for mathematical reasoning face a fundamental challenge: on problems the model cannot solve at all - "cliff" prompts - the RL gradient vanishes entirely, preventing any learning signal from reaching these failure modes. We introduce Hybrid Distillation Policy Optimization (HDPO), which augments standard RL with privileged self-distillation targeting cliff prompts. On each training step, HDPO identifies prompts where all rollouts fail, generates privileged rollouts by providing the model with ground-truth information, filters for correct solutions, and distills the teacher's token-level distribution into the student. Because teacher and student share the same weights - differing only in their input - the realizability gap is provably bounded, unlike cross-model distillation. We prove that R=1 filtered privileged generation recovers the optimal KL-regularized RL policy in the hard-threshold limit. Experiments on OpenMathInstruct-2 with Qwen2.5-Math-1.5B-Instruct show that HDPO consistently improves coverage metrics (pass@4 by +0.8-1.1%, pass@8 by +0.4-1.7%) while maintain...

Originally published on March 26, 2026. Curated by AI News.

Llms

Nicolas Carlini (67.2k citations on Google Scholar) says Claude is a better security researcher than him, made $3.7 million from exploiting smart contracts, and found vulnerabilities in Linux and Ghost

Link: https://m.youtube.com/watch?v=1sd26pWhfmg The Linux exploit is especially interesting because it was introduced in 2003 and was nev...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Llms

[P] I built an autonomous ML agent that runs experiments on tabular data indefinitely - inspired by Karpathy's AutoResearch

Inspired by Andrej Karpathy's AutoResearch, I built a system where Claude Code acts as an autonomous ML researcher on tabular binary clas...

Reddit - Machine Learning · 1 min · about 1 hour ago

Llms

[R] BraiNN: An Experimental Neural Architecture with Working Memory, Relational Reasoning, and Adaptive Learning

BraiNN An Experimental Neural Architecture with Working Memory, Relational Reasoning, and Adaptive Learning BraiNN is a compact research‑...

Reddit - Machine Learning · 1 min · about 2 hours ago

Llms

We hit 150 stars on our AI setup tool!

yo folks, we just hit 150 stars on our open source tool that auto makes AI context files. got 90 PRs merged and 20 issues that ppl are pi...

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

[2603.23871] HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation

About this article

Related Articles

Nicolas Carlini (67.2k citations on Google Scholar) says Claude is a better security researcher than him, made $3.7 million from exploiting smart contracts, and found vulnerabilities in Linux and Ghost

[P] I built an autonomous ML agent that runs experiments on tabular data indefinitely - inspired by Karpathy's AutoResearch

[R] BraiNN: An Experimental Neural Architecture with Working Memory, Relational Reasoning, and Adaptive Learning

We hit 150 stars on our AI setup tool!

No comments

Stay updated with AI News