[2602.14080] Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

[2602.14080] Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

arXiv - AI 4 min read Article

Summary

The paper explores the limitations of factuality evaluations in large language models (LLMs), identifying recall as a key bottleneck in accessing encoded knowledge, rather than mere knowledge absence.

Why It Matters

Understanding the recall limitations in LLMs is crucial for improving their factual accuracy and performance. This research highlights the need for better methodologies that enhance how models utilize existing knowledge, which can lead to more reliable AI applications.

Key Takeaways

  • Recall issues in LLMs often stem from access limitations rather than missing knowledge.
  • A new benchmark, WikiProfile, helps profile factual knowledge in LLMs.
  • Thinking processes can significantly improve recall and reduce errors.
  • Current LLMs encode a high percentage of facts, but accessing them remains challenging.
  • Future improvements may focus more on recall methods than on scaling model size.

Computer Science > Computation and Language arXiv:2602.14080 (cs) [Submitted on 15 Feb 2026] Title:Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality Authors:Nitay Calderon, Eyal Ben-David, Zorik Gekhman, Eran Ofek, Gal Yona View a PDF of the paper titled Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality, by Nitay Calderon and 4 other authors View PDF Abstract:Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show ...

Related Articles

Llms

Nvidia goes all-in on AI agents while Anthropic pulls the plug

TLDR: Nvidia is partnering with 17 major companies to build a platform specifically for enterprise AI agents, basically trying to become ...

Reddit - Artificial Intelligence · 1 min ·
Anthropic says Claude Code subscribers will need to pay extra for OpenClaw usage | TechCrunch
Llms

Anthropic says Claude Code subscribers will need to pay extra for OpenClaw usage | TechCrunch

It’s about to become more expensive for Claude Code subscribers to use Anthropic’s coding assistant with OpenClaw and other third-party t...

TechCrunch - AI · 4 min ·
Llms

I am seeing Claude everywhere

Every single Instagram reel or TikTok I scroll i see people mentioning Claude and glazing it like it’s some kind of master tool that’s be...

Reddit - Artificial Intelligence · 1 min ·
Llms

Claude Opus 4.6 API at 40% below Anthropic pricing – try free before you pay anything

Hey everyone I've set up a self-hosted API gateway using [New-API](QuantumNous/new-ap) to manage and distribute Claude Opus 4.6 access ac...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime