[2601.02663] When Do Tools and Planning Help Large Language Models Think? A Cost- and Latency-Aware Benchmark
About this article
Abstract page for arXiv paper 2601.02663: When Do Tools and Planning Help Large Language Models Think? A Cost- and Latency-Aware Benchmark
Computer Science > Computation and Language arXiv:2601.02663 (cs) [Submitted on 6 Jan 2026 (v1), last revised 5 Mar 2026 (this version, v2)] Title:When Do Tools and Planning Help Large Language Models Think? A Cost- and Latency-Aware Benchmark Authors:Subha Ghoshal, Ali Al-Bustami View a PDF of the paper titled When Do Tools and Planning Help Large Language Models Think? A Cost- and Latency-Aware Benchmark, by Subha Ghoshal and Ali Al-Bustami View PDF HTML (experimental) Abstract:Modern large language models (LLMs) increasingly rely on inference-time planning and external tools to improve reasoning. We benchmark this behavior on two real-world settings: event-centric question answering over graph-structured knowledge (Event-QA) and persuasive response generation in Reddit ChangeMyView (CMV). Using LangChain and LangGraph, we compare a one-shot baseline against a plan-execute-replan agent equipped with task-specific tools (DBpedia SPARQL/lookup/schema exploration, Wikipedia-focused retrieval, and topical web search). We evaluate on 60 examples each from Event-QA and CMV (3 splits of 20), and report both mean end-to-end latency and per-example token cost estimates. We evaluate GPT-4o and GPT-4o-mini under identical workflows and report accuracy and end-to-end latency. On Event-QA, the best tool-augmented configuration improves accuracy (e.g., 47.5\% $\rightarrow$ 67.5\% for GPT-4o) while increasing latency by orders of magnitude ($\sim$8s $\rightarrow$ $\sim$317s per example...