Llms Machine Learning Ai Agents

[2602.17931] Memory-Based Advantage Shaping for LLM-Guided Reinforcement Learning

arXiv - Machine Learning February 23, 2026 3 min read Article

Summary

This article presents a novel approach to reinforcement learning (RL) using memory-based advantage shaping, leveraging large language models (LLMs) for improved sample efficiency and learning speed in environments with sparse rewards.

Why It Matters

The research addresses critical challenges in reinforcement learning, particularly in environments with sparse rewards where traditional methods struggle. By integrating LLMs for subgoal discovery while minimizing reliance on continuous LLM supervision, this approach enhances scalability and efficiency, making it relevant for developers and researchers in AI and machine learning.

Key Takeaways

Introduces a memory graph for encoding subgoals and trajectories.
Improves sample efficiency and early learning speed in RL tasks.
Reduces dependency on frequent LLM interactions while maintaining performance.

Computer Science > Machine Learning arXiv:2602.17931 (cs) [Submitted on 20 Feb 2026] Title:Memory-Based Advantage Shaping for LLM-Guided Reinforcement Learning Authors:Narjes Nourzad, Carlee Joe-Wong View a PDF of the paper titled Memory-Based Advantage Shaping for LLM-Guided Reinforcement Learning, by Narjes Nourzad and 1 other authors View PDF HTML (experimental) Abstract:In environments with sparse or delayed rewards, reinforcement learning (RL) incurs high sample complexity due to the large number of interactions needed for learning. This limitation has motivated the use of large language models (LLMs) for subgoal discovery and trajectory guidance. While LLMs can support exploration, frequent reliance on LLM calls raises concerns about scalability and reliability. We address these challenges by constructing a memory graph that encodes subgoals and trajectories from both LLM guidance and the agent's own successful rollouts. From this graph, we derive a utility function that evaluates how closely the agent's trajectories align with prior successful strategies. This utility shapes the advantage function, providing the critic with additional guidance without altering the reward. Our method relies primarily on offline input and only occasional online queries, avoiding dependence on continuous LLM supervision. Preliminary experiments in benchmark environments show improved sample efficiency and faster early learning compared to baseline RL methods, with final returns compara...

Read Original Article