[2602.17686] Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

[2602.17686] Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

arXiv - Machine Learning 3 min read Article

Summary

This article presents a novel three-stage curriculum learning framework for distilling Chain-of-Thought reasoning from large language models into compact student models, improving accuracy and reducing verbosity.

Why It Matters

As AI models become increasingly complex, effective methods for distilling their reasoning capabilities into smaller models are crucial for practical applications. This research addresses the challenge of maintaining interpretability while enhancing performance, which is vital for the advancement of AI technologies in various fields.

Key Takeaways

  • Introduces a three-stage curriculum learning framework for model distillation.
  • Utilizes Structure-Aware Masking and Group Relative Policy Optimization (GRPO) to enhance learning.
  • Achieves an 11.29% accuracy improvement while reducing output length by 27.4%.
  • Addresses the challenge of compressing verbose teacher rationales into compact models.
  • Demonstrates effectiveness through experiments on the GSM8K dataset.

Computer Science > Machine Learning arXiv:2602.17686 (cs) [Submitted on 5 Feb 2026] Title:Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO Authors:Bowen Yu, Maolin Wang, Sheng Zhang, Binhao Wang, Yi Wen, Jingtong Gao, Bowen Liu, Zimo Zhao, Wanyu Wang, Xiangyu Zhao View a PDF of the paper titled Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO, by Bowen Yu and 9 other authors View PDF HTML (experimental) Abstract:Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge: teacher rationales are often too verbose for smaller models to faithfully reproduce. Existing approaches either compress reasoning into single-step, losing the interpretability that makes CoT valuable. We present a three-stage curriculum learning framework that addresses this capacity mismatch through progressive skill acquisition. First, we establish structural understanding via masked shuffled reconstruction. Second, we apply Group Relative Policy Optimization (GRPO) on masked completion tasks, enabling the model to discover its own balance between accuracy and brevity. Third, we identify persistent failure cases and guide the student to internalize teacher knowledge through targeted rewriting, again optimized with GRPO. Experiments on GSM8K demonstrate that our approach enables Qwen2.5-3B-Base to achieve an 11.29 percent accuracy...

Related Articles

I let Gemini in Google Maps plan my day and it went surprisingly well | The Verge
Llms

I let Gemini in Google Maps plan my day and it went surprisingly well | The Verge

Gemini in Google Maps is a surprisingly useful way to explore new territory.

The Verge - AI · 11 min ·
Llms

The person who replaces you probably won't be AI. It'll be someone from the next department over who learned to use it - opinion/discussion

I'm a strategy person by background. Two years ago I'd write a recommendation and hand it to a product team. Now.. I describe what I want...

Reddit - Artificial Intelligence · 1 min ·
Block Resets Management With AI As Cash App Adds Installment Transfers
Llms

Block Resets Management With AI As Cash App Adds Installment Transfers

Block (NYSE:XYZ) plans a permanent organizational overhaul that replaces many middle management roles with AI-driven models to create fla...

AI Tools & Products · 5 min ·
Anthropic leaks source code for its AI coding agent Claude
Llms

Anthropic leaks source code for its AI coding agent Claude

Anthropic accidentally exposed roughly 512,000 lines of proprietary TypeScript source code for its AI-powered coding agent Claude Code

AI Tools & Products · 3 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime