[2604.00223] Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation
About this article
Abstract page for arXiv paper 2604.00223: Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation
Computer Science > Machine Learning arXiv:2604.00223 (cs) [Submitted on 31 Mar 2026] Title:Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation Authors:Hoang-Chau Luong, Dat Ba Tran, Lingwei Chen View a PDF of the paper titled Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation, by Hoang-Chau Luong and 2 other authors View PDF HTML (experimental) Abstract:Reverse Kullback-Leibler (RKL) divergence has recently emerged as the preferred objective for large language model (LLM) distillation, consistently outperforming forward KL (FKL), particularly in regimes with large vocabularies and significant teacher-student capacity mismatch, where RKL focuses learning on dominant modes rather than enforcing dense alignment. However, RKL introduces a structural limitation that drives the student toward overconfident predictions. We first provide an analysis of RKL by decomposing its gradients into target and non-target components, and show that non-target gradients consistently push the target logit upward even when the student already matches the teacher, thereby reducing output diversity. In addition, RKL provides weak supervision over non-target classes, leading to poor tail alignment. To address these issues, we propose Diversity-aware RKL (DRKL), which removes this gradient effect and strengthens non-target supervision while preserving the optimization benefits of RKL. Extensive experiments across datasets a...