[2602.22817] Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks
Summary
This paper presents Hierarchy-of-Groups Policy Optimization (HGPO), a novel approach to improve group-based reinforcement learning for long-horizon tasks by addressing context inconsistency in advantage estimation.
Why It Matters
HGPO addresses a significant challenge in reinforcement learning by improving the accuracy of advantage estimations, which can enhance the performance of AI systems in complex tasks. This is particularly relevant for applications in robotics and AI agents, where long-horizon decision-making is crucial.
Key Takeaways
- HGPO improves stepwise advantage estimation by using hierarchical grouping.
- The method mitigates context inconsistency issues that affect policy optimization.
- Empirical evaluations demonstrate HGPO's superiority over existing methods in agentic tasks.
- The approach does not require additional models or rollouts, making it efficient.
- Code for HGPO is publicly available, promoting further research and application.
Computer Science > Machine Learning arXiv:2602.22817 (cs) [Submitted on 26 Feb 2026] Title:Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks Authors:Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng, Bo An View a PDF of the paper titled Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks, by Shuo He and 5 other authors View PDF HTML (experimental) Abstract:Group-based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long-horizon agentic tasks. To enable more fine-grained policy updates, recent research has increasingly shifted toward stepwise group-based policy optimization, which treats each step in a rollout trajectory independently while using a memory module to retain historical context. However, we find a key issue in estimating stepwise relative advantages, namely context inconsistency, where steps within the same group may differ in their historical contexts. Empirically, we reveal that this issue can lead to severely biased advantage estimation, thereby degrading policy optimization significantly. To address the issue, in this paper, we propose Hierarchy-of-Groups Policy Optimization (HGPO) for long-horizon agentic tasks. Specifically, within a group of rollout trajectories, HGPO assigns each step to multiple hierarchical groups according to the consistency of historical contexts. Then, for each step, HGPO computes distinct advantages within each group and aggregates them with...