[2602.20296] Learning to Solve Complex Problems via Dataset Decomposition
Summary
This paper presents a novel approach to curriculum learning by decomposing complex datasets into simpler components, enhancing model training through a teacher-student framework.
Why It Matters
The research addresses challenges in training machine learning models on complex tasks by proposing a systematic method to simplify data. This could lead to improved performance in various applications, particularly in fields requiring advanced reasoning and problem-solving capabilities.
Key Takeaways
- Introduces a reverse curriculum generation approach for dataset decomposition.
- Proposes a teacher-student framework to facilitate learning from simpler examples.
- Develops a scoring system to assess data difficulty based on complexity.
- Demonstrates superior model performance on math and code generation datasets.
- Highlights the potential for improved training methodologies in machine learning.
Computer Science > Machine Learning arXiv:2602.20296 (cs) [Submitted on 23 Feb 2026] Title:Learning to Solve Complex Problems via Dataset Decomposition Authors:Wanru Zhao, Lucas Caccia, Zhengyan Shi, Minseon Kim, Weijia Xu, Alessandro Sordoni View a PDF of the paper titled Learning to Solve Complex Problems via Dataset Decomposition, by Wanru Zhao and 5 other authors View PDF HTML (experimental) Abstract:Curriculum learning is a class of training strategies that organizes the data being exposed to a model by difficulty, gradually from simpler to more complex examples. This research explores a reverse curriculum generation approach that recursively decomposes complex datasets into simpler, more learnable components. We propose a teacher-student framework where the teacher is equipped with the ability to reason step-by-step, which is used to recursively generate easier versions of examples, enabling the student model to progressively master difficult tasks. We propose a novel scoring system to measure data difficulty based on its structural complexity and conceptual depth, allowing curriculum construction over decomposed data. Experiments on math datasets (MATH and AIME) and code generation datasets demonstrate that models trained with curricula generated by our approach exhibit superior performance compared to standard training on original datasets. Comments: Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.20296 [cs.LG] (or arXiv:2602.20296v1 [cs.LG] for this versi...