[2602.16165] HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents
Summary
The HiPER framework introduces a hierarchical approach to reinforcement learning for large language model agents, enhancing decision-making in complex tasks with sparse rewards.
Why It Matters
HiPER addresses the limitations of traditional flat reinforcement learning models by providing a structured method for credit assignment in multi-turn decision-making tasks. This innovation is crucial for improving the efficiency and effectiveness of large language models in real-world applications, particularly in scenarios requiring long-term planning and execution.
Key Takeaways
- HiPER separates high-level planning from low-level execution in RL.
- Hierarchical advantage estimation (HAE) improves credit assignment.
- State-of-the-art performance achieved on benchmarks like ALFWorld and WebShop.
- Explicit hierarchical decomposition enhances scalability in RL training.
- Significant gains noted in long-horizon tasks requiring multiple subtasks.
Computer Science > Machine Learning arXiv:2602.16165 (cs) [Submitted on 18 Feb 2026] Title:HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents Authors:Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, Mingyi Hong View a PDF of the paper titled HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents, by Jiangweizhi Peng and 6 other authors View PDF HTML (experimental) Abstract:Training LLMs as interactive agents for multi-turn decision-making remains challenging, particularly in long-horizon tasks with sparse and delayed rewards, where agents must execute extended sequences of actions before receiving meaningful feedback. Most existing reinforcement learning (RL) approaches model LLM agents as flat policies operating at a single time scale, selecting one action at each turn. In sparse-reward settings, such flat policies must propagate credit across the entire trajectory without explicit temporal abstraction, which often leads to unstable optimization and inefficient credit assignment. We propose HiPER, a novel Hierarchical Plan-Execute RL framework that explicitly separates high-level planning from low-level execution. HiPER factorizes the policy into a high-level planner that proposes subgoals and a low-level executor that carries them out over multiple action steps. To align optimization with this structure, we introduce a key t...