[2602.23008] Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

[2602.23008] Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

arXiv - AI 3 min read Article

Summary

The paper presents EMPO², a novel hybrid RL framework that enhances exploration in memory-augmented LLM agents, achieving significant performance improvements in various tasks.

Why It Matters

Exploration is a critical challenge in reinforcement learning for LLMs. EMPO² addresses this by effectively combining on- and off-policy optimization, making it a significant advancement for developing adaptable AI agents capable of handling new tasks with minimal trials.

Key Takeaways

  • EMPO² improves exploration in LLM agents using memory.
  • Achieves 128.6% and 11.3% performance gains over existing methods.
  • Demonstrates adaptability to new tasks with few trials.
  • Combines on- and off-policy updates for robust performance.
  • Addresses a key bottleneck in reinforcement learning for LLMs.

Computer Science > Machine Learning arXiv:2602.23008 (cs) [Submitted on 26 Feb 2026] Title:Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization Authors:Zeyuan Liu, Jeonghye Kim, Xufang Luo, Dongsheng Li, Yuqing Yang View a PDF of the paper titled Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization, by Zeyuan Liu and 4 other authors View PDF HTML (experimental) Abstract:Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose Exploratory Memory-Augmented On- and Off-Policy Optimization (EMPO$^2$), a hybrid RL framework that leverages memory for exploration and combines on- and off-policy updates to make LLMs perform well with memory while also ensuring robustness without it. On ScienceWorld and WebShop, EMPO$^2$ achieves 128.6% and 11.3% improvements over GRPO, respectively. Moreover, in out-of-distribution tests, EMPO$^2$ demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates. These results highlight EMPO$^2$ as a promising framework for building more exploratory and generalizable LLM-based agents. Comments: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.23008 [cs.LG]   (or arXiv:2602.23008v1 [cs.LG] for this version)   https://doi.org/10.48...

Related Articles

Llms

Artificial intelligence will always depends on human otherwise it will be obsolete.

I was looking for a tool for my specific need. There was not any. So i started to write the program in python, just basic structure. Then...

Reddit - Artificial Intelligence · 1 min ·
Llms

My AI spent last night modifying its own codebase

I've been working on a local AI system called Apis that runs completely offline through Ollama. During a background run, Apis identified ...

Reddit - Artificial Intelligence · 1 min ·
Llms

Fake users generated by AI can't simulate humans — review of 182 research papers. Your thoughts?

https://www.researchsquare.com/article/rs-9057643/v1 There’s a massive trend right now where tech companies, businesses, even researchers...

Reddit - Artificial Intelligence · 1 min ·
Llms

Depth-first pruning seems to transfer from GPT-2 to Llama (unexpectedly well)

TL;DR: Removing the right transformer layers (instead of shrinking all layers) gives smaller, faster models with minimal quality loss — a...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime