[2602.07906] AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering
Summary
The paper presents AceGRPO, a novel approach for enhancing autonomous machine learning engineering through adaptive curriculum and group relative policy optimization, addressing challenges in reinforcement learning applications.
Why It Matters
As machine learning continues to evolve, optimizing agent performance over long horizons is crucial. AceGRPO offers a solution to common issues like behavioral stagnation and inefficient data selection, making it relevant for researchers and practitioners in AI and machine learning.
Key Takeaways
- AceGRPO introduces an Evolving Data Buffer for continuous task repurposing.
- Adaptive Sampling prioritizes tasks based on Learnability Potential to enhance learning efficiency.
- The Ace-30B model demonstrates a 100% valid submission rate on MLE-Bench-Lite, indicating strong performance.
- The approach addresses execution latency and data selection inefficiencies in reinforcement learning.
- AceGRPO outperforms larger open-source models, showcasing its robustness.
Computer Science > Machine Learning arXiv:2602.07906 (cs) [Submitted on 8 Feb 2026 (v1), last revised 24 Feb 2026 (this version, v2)] Title:AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering Authors:Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Siheng Chen View a PDF of the paper titled AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering, by Yuzhu Cai and 4 other authors View PDF HTML (experimental) Abstract:Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons. While recent LLM-based agents show promise, current prompt-based agents for MLE suffer from behavioral stagnation due to frozen parameters. Although Reinforcement Learning (RL) offers a remedy, applying it to MLE is hindered by prohibitive execution latency and inefficient data selection. Recognizing these challenges, we propose AceGRPO with two core components: (1) Evolving Data Buffer that continuously repurposes execution traces into reusable training tasks, and (2) Adaptive Sampling guided by a Learnability Potential function, which dynamically prioritizes tasks at the agent's learning frontier to maximize learning efficiency. Leveraging AceGRPO, our trained Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches the performance of proprietary frontier models, and outperforms larger open-...