[2509.03581] Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents
Summary
This paper presents a framework for dynamic planning in large language model (LLM) agents, allowing them to efficiently allocate compute resources during test-time for improved problem-solving in complex tasks.
Why It Matters
As LLMs become increasingly integral to AI applications, optimizing their planning capabilities can enhance their efficiency and effectiveness. This research addresses the computational challenges associated with planning, potentially leading to more capable and collaborative AI systems.
Key Takeaways
- Dynamic planning improves sample efficiency for LLM agents.
- A two-stage training pipeline enhances planning capabilities.
- Human-written plans can significantly boost agent performance.
Computer Science > Artificial Intelligence arXiv:2509.03581 (cs) [Submitted on 3 Sep 2025 (v1), last revised 17 Feb 2026 (this version, v3)] Title:Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents Authors:Davide Paglieri, Bartłomiej Cupiał, Jonathan Cook, Ulyana Piterbarg, Jens Tuyls, Edward Grefenstette, Jakob Nicolaus Foerster, Jack Parker-Holder, Tim Rocktäschel View a PDF of the paper titled Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents, by Davide Paglieri and 8 other authors View PDF HTML (experimental) Abstract:Training large language models (LLMs) to reason via reinforcement learning (RL) significantly improves their problem-solving capabilities. In agentic settings, existing methods like ReAct prompt LLMs to explicitly plan before every action; however, we demonstrate that always planning is computationally expensive and degrades performance on long-horizon tasks, while never planning further limits performance. To address this, we introduce a conceptual framework formalizing dynamic planning for LLM agents, enabling them to flexibly decide when to allocate test-time compute for planning. We propose a simple two-stage training pipeline: (1) supervised fine-tuning on diverse synthetic data to prime models for dynamic planning, and (2) RL to refine this capability in long-horizon environments. Experiments on the Crafter environment show that dynamic planning agents trained with this approach are more sampl...