[2602.17686] Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO
Summary
This article presents a novel three-stage curriculum learning framework for distilling Chain-of-Thought reasoning from large language models into compact student models, improving accuracy and reducing verbosity.
Why It Matters
As AI models become increasingly complex, effective methods for distilling their reasoning capabilities into smaller models are crucial for practical applications. This research addresses the challenge of maintaining interpretability while enhancing performance, which is vital for the advancement of AI technologies in various fields.
Key Takeaways
- Introduces a three-stage curriculum learning framework for model distillation.
- Utilizes Structure-Aware Masking and Group Relative Policy Optimization (GRPO) to enhance learning.
- Achieves an 11.29% accuracy improvement while reducing output length by 27.4%.
- Addresses the challenge of compressing verbose teacher rationales into compact models.
- Demonstrates effectiveness through experiments on the GSM8K dataset.
Computer Science > Machine Learning arXiv:2602.17686 (cs) [Submitted on 5 Feb 2026] Title:Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO Authors:Bowen Yu, Maolin Wang, Sheng Zhang, Binhao Wang, Yi Wen, Jingtong Gao, Bowen Liu, Zimo Zhao, Wanyu Wang, Xiangyu Zhao View a PDF of the paper titled Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO, by Bowen Yu and 9 other authors View PDF HTML (experimental) Abstract:Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge: teacher rationales are often too verbose for smaller models to faithfully reproduce. Existing approaches either compress reasoning into single-step, losing the interpretability that makes CoT valuable. We present a three-stage curriculum learning framework that addresses this capacity mismatch through progressive skill acquisition. First, we establish structural understanding via masked shuffled reconstruction. Second, we apply Group Relative Policy Optimization (GRPO) on masked completion tasks, enabling the model to discover its own balance between accuracy and brevity. Third, we identify persistent failure cases and guide the student to internalize teacher knowledge through targeted rewriting, again optimized with GRPO. Experiments on GSM8K demonstrate that our approach enables Qwen2.5-3B-Base to achieve an 11.29 percent accuracy...