[2602.16902] LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs

[2602.16902] LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs

arXiv - Machine Learning 4 min read Article

Summary

The paper presents LLM-WikiRace, a benchmark for evaluating long-term planning and reasoning capabilities in large language models (LLMs) by navigating Wikipedia links to reach target pages.

Why It Matters

As LLMs become increasingly integrated into applications requiring reasoning and planning, LLM-WikiRace provides a critical evaluation tool to identify their limitations and areas for improvement. Understanding these capabilities is essential for advancing AI applications that rely on complex reasoning over real-world knowledge.

Key Takeaways

  • LLM-WikiRace benchmarks LLMs on their planning and reasoning abilities using Wikipedia navigation.
  • Current top models like Gemini-3 and GPT-5 show strong performance on easier tasks but struggle significantly with harder challenges.
  • World knowledge is crucial, but after a certain threshold, planning and reasoning capabilities become more important.
  • The benchmark reveals that even advanced models often fail to recover from errors, indicating a need for improved replanning strategies.
  • LLM-WikiRace serves as an open platform for ongoing evaluation and development of planning-capable LLMs.

Computer Science > Artificial Intelligence arXiv:2602.16902 (cs) [Submitted on 18 Feb 2026] Title:LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs Authors:Juliusz Ziomek, William Bankes, Lorenz Wolf, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic View a PDF of the paper titled LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs, by Juliusz Ziomek and 5 other authors View PDF HTML (experimental) Abstract:We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23\% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this threshold, planning and long-horizon reasoning capabilities become the dominant factors. Trajectory-level ...

Related Articles

Llms

Nvidia goes all-in on AI agents while Anthropic pulls the plug

TLDR: Nvidia is partnering with 17 major companies to build a platform specifically for enterprise AI agents, basically trying to become ...

Reddit - Artificial Intelligence · 1 min ·
Anthropic says Claude Code subscribers will need to pay extra for OpenClaw usage | TechCrunch
Llms

Anthropic says Claude Code subscribers will need to pay extra for OpenClaw usage | TechCrunch

It’s about to become more expensive for Claude Code subscribers to use Anthropic’s coding assistant with OpenClaw and other third-party t...

TechCrunch - AI · 4 min ·
Llms

I am seeing Claude everywhere

Every single Instagram reel or TikTok I scroll i see people mentioning Claude and glazing it like it’s some kind of master tool that’s be...

Reddit - Artificial Intelligence · 1 min ·
Llms

Claude Opus 4.6 API at 40% below Anthropic pricing – try free before you pay anything

Hey everyone I've set up a self-hosted API gateway using [New-API](QuantumNous/new-ap) to manage and distribute Claude Opus 4.6 access ac...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime