[2602.12687] Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty
Summary
This paper introduces Calibrated Uncertainty Distillation (CUD), a novel approach to knowledge distillation that enhances the transfer of uncertain information from teacher models to student models, improving accuracy and robustness in machine learning tasks.
Why It Matters
As machine learning models become increasingly complex, ensuring they can handle uncertainty is crucial for real-world applications. CUD addresses the limitations of traditional distillation methods, which often lead to overconfident predictions that can fail under distribution shifts. This framework not only improves model performance but also enhances reliability in ambiguous scenarios, making it a significant advancement in the field.
Key Takeaways
- CUD improves the transfer of 'dark knowledge' by emphasizing uncertainty.
- The framework helps students learn from calibrated targets rather than overconfident ones.
- CUD enhances model accuracy and robustness, especially in high-cardinality tasks.
- The approach balances confident predictions with structured uncertainty.
- Results show improved performance on diverse benchmarks, particularly with ambiguous inputs.
Computer Science > Machine Learning arXiv:2602.12687 (cs) [Submitted on 13 Feb 2026] Title:Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty Authors:Jeonghyun Kim, SooKyung Kim, Richeng Xuan, Hyunsoo Cho View a PDF of the paper titled Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty, by Jeonghyun Kim and 3 other authors View PDF HTML (experimental) Abstract:The core of knowledge distillation lies in transferring the teacher's rich 'dark knowledge'-subtle probabilistic patterns that reveal how classes are related and the distribution of uncertainties. While this idea is well established, teachers trained with conventional cross-entropy often fail to preserve such signals. Their distributions collapse into sharp, overconfident peaks that appear decisive but are in fact brittle, offering little beyond the hard label or subtly hindering representation-level transfer. This overconfidence is especially problematic in high-cardinality tasks, where the nuances among many plausible classes matter most for guiding a compact student. Moreover, such brittle targets reduce robustness under distribution shift, leaving students vulnerable to miscalibration in real-world conditions. To address this limitation, we revisit distillation from a distributional perspective and propose Calibrated Uncertainty Distillation (CUD), a framework designed to make dark knowledge more faithfully accessible. Instead of uncritically adopting ...