[2603.19266] Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion
About this article
Abstract page for arXiv paper 2603.19266: Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion
Computer Science > Computation and Language arXiv:2603.19266 (cs) [Submitted on 26 Feb 2026] Title:Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion Authors:Zhen Tan, Chengshuai Zhao, Song Wang, Jundong Li, Tianlong Chen, Huan Liu View a PDF of the paper titled Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion, by Zhen Tan and 5 other authors View PDF Abstract:Distilling robust reasoning capabilities from large language models (LLMs) into smaller, computationally efficient student models remains an unresolved challenge. Despite recent advances, distilled models frequently suffer from superficial pattern memorization and subpar generalization. To overcome these limitations, we introduce a novel distillation framework that moves beyond simple mimicry to instill a deeper conceptual understanding. Our framework features two key innovations. \underline{\textit{First}}, to address pattern memorization, Explanatory Inversion (EI) generates targeted ``explanatory probes'' that compel the student to articulate the underlying logic behind an answer, rather than just memorizing it. \underline{\textit{Second}}, to improve generalization, Explanatory GRPO (\texttt{EXGRPO}) uses a reinforcement learning algorithm with a novel Dialogue Structure Utility Bonus, which explicitly rewards the student for maintaining a coherent reasoning process across these probes. Extensive evaluations on 12 datasets demonstrate significant improve...