[2602.22787] Probing for Knowledge Attribution in Large Language Models
Summary
This article explores knowledge attribution in large language models (LLMs), focusing on how to identify the source of information that leads to model outputs, addressing issues of accuracy and reliability.
Why It Matters
Understanding knowledge attribution in LLMs is crucial for improving their reliability and mitigating issues like hallucinations. This research introduces a method to discern whether outputs are based on user prompts or internal knowledge, which can enhance the trustworthiness of AI systems in critical applications.
Key Takeaways
- Probing can effectively identify the knowledge source behind LLM outputs.
- The AttriWiki data pipeline generates labeled examples for training attribution models.
- Attribution mismatches can significantly increase error rates in model outputs.
- Models may still produce incorrect answers even with accurate attribution.
- Improving knowledge attribution is essential for enhancing model reliability.
Computer Science > Computation and Language arXiv:2602.22787 (cs) [Submitted on 26 Feb 2026] Title:Probing for Knowledge Attribution in Large Language Models Authors:Ivo Brink, Alexander Boer, Dennis Ulmer View a PDF of the paper titled Probing for Knowledge Attribution in Large Language Models, by Ivo Brink and 2 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) often generate fluent but unfounded claims, or hallucinations, which fall into two types: (i) faithfulness violations - misusing user context - and (ii) factuality violations - errors from internal knowledge. Proper mitigation depends on knowing whether a model's answer is based on the prompt or its internal weights. This work focuses on the problem of contributive attribution: identifying the dominant knowledge source behind each output. We show that a probe, a simple linear classifier trained on model hidden representations, can reliably predict contributive attribution. For its training, we introduce AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read them from context, generating labelled examples automatically. Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retraining. Attribution mismatches raise error rates by up to 70%, demonstratin...