[2602.20122] NanoKnow: How to Know What Your Language Model Knows
Summary
The article discusses NanoKnow, a benchmark dataset designed to understand how large language models (LLMs) acquire knowledge, using the nanochat family of models with transparent pre-training data.
Why It Matters
Understanding how LLMs encode knowledge is crucial for improving their performance and reliability. NanoKnow provides insights into the influence of pre-training data on model accuracy, which can guide future research and development in AI and machine learning.
Key Takeaways
- NanoKnow benchmarks help identify knowledge sources in LLMs.
- Model accuracy is influenced by the frequency of answers in pre-training data.
- External evidence can enhance model performance but is complementary to pre-trained knowledge.
- Non-relevant information negatively impacts accuracy.
- The study emphasizes the importance of transparency in LLM training data.
Computer Science > Computation and Language arXiv:2602.20122 (cs) [Submitted on 23 Feb 2026] Title:NanoKnow: How to Know What Your Language Model Knows Authors:Lingwei Gu, Nour Jedidi, Jimmy Lin View a PDF of the paper titled NanoKnow: How to Know What Your Language Model Knows, by Lingwei Gu and 2 other authors View PDF HTML (experimental) Abstract:How do large language models (LLMs) know what they know? Answering this question has been difficult because pre-training data is often a "black box" -- unknown or inaccessible. The recent release of nanochat -- a family of small LLMs with fully open pre-training data -- addresses this as it provides a transparent view into where a model's parametric knowledge comes from. Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre-training corpus. Using these splits, we can now properly disentangle the sources of knowledge that LLMs rely on when producing an output. To demonstrate NanoKnow's utility, we conduct experiments using eight nanochat checkpoints. Our findings show: (1) closed-book accuracy is strongly influenced by answer frequency in the pre-training data, (2) providing external evidence can mitigate this frequency dependence, (3) even with external evidence, models are more accurate when answers were seen during pre-training, demonstrating...