[2602.16902] LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs
Summary
The paper presents LLM-WikiRace, a benchmark for evaluating long-term planning and reasoning capabilities in large language models (LLMs) by navigating Wikipedia links to reach target pages.
Why It Matters
As LLMs become increasingly integrated into applications requiring reasoning and planning, LLM-WikiRace provides a critical evaluation tool to identify their limitations and areas for improvement. Understanding these capabilities is essential for advancing AI applications that rely on complex reasoning over real-world knowledge.
Key Takeaways
- LLM-WikiRace benchmarks LLMs on their planning and reasoning abilities using Wikipedia navigation.
- Current top models like Gemini-3 and GPT-5 show strong performance on easier tasks but struggle significantly with harder challenges.
- World knowledge is crucial, but after a certain threshold, planning and reasoning capabilities become more important.
- The benchmark reveals that even advanced models often fail to recover from errors, indicating a need for improved replanning strategies.
- LLM-WikiRace serves as an open platform for ongoing evaluation and development of planning-capable LLMs.
Computer Science > Artificial Intelligence arXiv:2602.16902 (cs) [Submitted on 18 Feb 2026] Title:LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs Authors:Juliusz Ziomek, William Bankes, Lorenz Wolf, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic View a PDF of the paper titled LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs, by Juliusz Ziomek and 5 other authors View PDF HTML (experimental) Abstract:We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23\% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this threshold, planning and long-horizon reasoning capabilities become the dominant factors. Trajectory-level ...