[2602.20532] Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training
Summary
The paper presents ACTOR-CURATOR, a novel framework for curriculum learning in reinforcement learning, enhancing post-training for large language models through dynamic problem selection.
Why It Matters
As AI models grow in complexity, effective training methods become crucial. ACTOR-CURATOR addresses the challenges of curriculum learning in reinforcement learning, offering a scalable solution that improves training efficiency and stability, which is vital for advancing AI capabilities.
Key Takeaways
- ACTOR-CURATOR automates curriculum learning for reinforcement learning post-training.
- The framework significantly outperforms traditional sampling methods in training efficiency.
- It achieves notable performance gains on challenging reasoning benchmarks.
- The approach is scalable and practical for large language models.
- Dynamic problem selection is framed as a non-stationary stochastic bandit problem.
Computer Science > Machine Learning arXiv:2602.20532 (cs) [Submitted on 24 Feb 2026] Title:Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training Authors:Zhengyao Gu, Jonathan Light, Raul Astudillo, Ziyu Ye, Langzhou He, Henry Peng Zou, Wei Cheng, Santiago Paternain, Philip S. Yu, Yisong Yue View a PDF of the paper titled Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training, by Zhengyao Gu and 9 other authors View PDF HTML (experimental) Abstract:Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and ef...