[2602.04942] Privileged Information Distillation for Language Models
Summary
This paper presents methods for distilling privileged information in language models, focusing on improving performance in multi-turn environments without direct access to reasoning processes.
Why It Matters
The study addresses a critical challenge in AI: how to leverage privileged information during training to enhance the capabilities of language models in real-world applications. The proposed methods, {C0}-Distill and OPSD, offer innovative solutions that could significantly improve the effectiveness of reinforcement learning in complex tasks.
Key Takeaways
- Privileged information can enhance language model performance in challenging tasks.
- The {C0}-Distill method effectively trains models using action-only privileged information.
- On-Policy Self-Distillation (OPSD) provides an alternative approach using reinforcement learning.
- Both methods outperform traditional supervised fine-tuning and RL practices.
- The research includes extensive analysis on factors enabling effective learning with privileged information.
Computer Science > Machine Learning arXiv:2602.04942 (cs) [Submitted on 4 Feb 2026 (v1), last revised 16 Feb 2026 (this version, v3)] Title:Privileged Information Distillation for Language Models Authors:Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, Massimo Caccia View a PDF of the paper titled Privileged Information Distillation for Language Models, by Emiliano Penaloza and 5 other authors View PDF HTML (experimental) Abstract:Training-time privileged information (PI) can enable language models to succeed on tasks they would otherwise fail, making it a powerful tool for reinforcement learning in hard, long-horizon settings. However, transferring capabilities learned with PI to policies that must act without it at inference time remains a fundamental challenge. We study this problem in the context of distilling frontier models for multi-turn agentic environments, which typically hide their internal reasoning and expose only action trajectories. This breaks standard distillation pipelines, since successful behavior is observable, but the reasoning process is not. For this, we introduce {\pi}-Distill, a joint teacher-student objective that trains a PI-conditioned teacher and an unconditioned student simultaneously using the same model. Additionally, we also introduce On-Policy Self-Distillation (OPSD), an alternative approach that trains using Reinforcement Learning (RL) with a reverse KL-penalty between the student and the PI-co...