[2602.14338] Train Less, Learn More: Adaptive Efficient Rollout Optimization for Group-Based Reinforcement Learning
Summary
The paper presents Adaptive Efficient Rollout Optimization (AERO), an enhancement to Group Relative Policy Optimization (GRPO) for reinforcement learning, improving compute efficiency by 48% while maintaining performance in large language model fine-tuning.
Why It Matters
This research is significant as it addresses the inefficiencies in reinforcement learning fine-tuning for large language models, a critical area in AI development. AERO's ability to reduce computational costs while enhancing performance can lead to more sustainable AI practices and faster model deployment.
Key Takeaways
- AERO improves compute efficiency by 48% compared to GRPO.
- The method reduces wall-clock time per step by approximately 45%.
- AERO maintains or improves performance metrics like Pass@8 and Avg@8.
- Adaptive strategies in AERO prevent zero-advantage dead zones.
- This approach is scalable and practical for RL-based LLM alignment.
Computer Science > Machine Learning arXiv:2602.14338 (cs) [Submitted on 15 Feb 2026] Title:Train Less, Learn More: Adaptive Efficient Rollout Optimization for Group-Based Reinforcement Learning Authors:Zhi Zhang, Zhen Han, Costas Mavromatis, Qi Zhu, Yunyi Zhang, Sheng Guan, Dingmin Wang, Xiong Zhou, Shuai Wang, Soji Adeshina, Vassilis Ioannidis, Huzefa Rangwala View a PDF of the paper titled Train Less, Learn More: Adaptive Efficient Rollout Optimization for Group-Based Reinforcement Learning, by Zhi Zhang and 11 other authors View PDF HTML (experimental) Abstract:Reinforcement learning (RL) plays a central role in large language model (LLM) post-training. Among existing approaches, Group Relative Policy Optimization (GRPO) is widely used, especially for RL with verifiable rewards (RLVR) fine-tuning. In GRPO, each query prompts the LLM to generate a group of rollouts with a fixed group size $N$. When all rollouts in a group share the same outcome, either all correct or all incorrect, the group-normalized advantages become zero, yielding no gradient signal and wasting fine-tuning compute. We introduce Adaptive Efficient Rollout Optimization (AERO), an enhancement of GRPO. AERO uses an adaptive rollout strategy, applies selective rejection to strategically prune rollouts, and maintains a Bayesian posterior to prevent zero-advantage dead zones. Across three model configurations (Qwen2.5-Math-1.5B, Qwen2.5-7B, and Qwen2.5-7B-Instruct), AERO improves compute efficiency without ...