[2602.23008] Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization
Summary
The paper presents EMPO², a novel hybrid RL framework that enhances exploration in memory-augmented LLM agents, achieving significant performance improvements in various tasks.
Why It Matters
Exploration is a critical challenge in reinforcement learning for LLMs. EMPO² addresses this by effectively combining on- and off-policy optimization, making it a significant advancement for developing adaptable AI agents capable of handling new tasks with minimal trials.
Key Takeaways
- EMPO² improves exploration in LLM agents using memory.
- Achieves 128.6% and 11.3% performance gains over existing methods.
- Demonstrates adaptability to new tasks with few trials.
- Combines on- and off-policy updates for robust performance.
- Addresses a key bottleneck in reinforcement learning for LLMs.
Computer Science > Machine Learning arXiv:2602.23008 (cs) [Submitted on 26 Feb 2026] Title:Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization Authors:Zeyuan Liu, Jeonghye Kim, Xufang Luo, Dongsheng Li, Yuqing Yang View a PDF of the paper titled Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization, by Zeyuan Liu and 4 other authors View PDF HTML (experimental) Abstract:Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose Exploratory Memory-Augmented On- and Off-Policy Optimization (EMPO$^2$), a hybrid RL framework that leverages memory for exploration and combines on- and off-policy updates to make LLMs perform well with memory while also ensuring robustness without it. On ScienceWorld and WebShop, EMPO$^2$ achieves 128.6% and 11.3% improvements over GRPO, respectively. Moreover, in out-of-distribution tests, EMPO$^2$ demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates. These results highlight EMPO$^2$ as a promising framework for building more exploratory and generalizable LLM-based agents. Comments: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.23008 [cs.LG] (or arXiv:2602.23008v1 [cs.LG] for this version) https://doi.org/10.48...