[2602.22787] Probing for Knowledge Attribution in Large Language Models

[2602.22787] Probing for Knowledge Attribution in Large Language Models

arXiv - AI 4 min read Article

Summary

This article explores knowledge attribution in large language models (LLMs), focusing on how to identify the source of information that leads to model outputs, addressing issues of accuracy and reliability.

Why It Matters

Understanding knowledge attribution in LLMs is crucial for improving their reliability and mitigating issues like hallucinations. This research introduces a method to discern whether outputs are based on user prompts or internal knowledge, which can enhance the trustworthiness of AI systems in critical applications.

Key Takeaways

  • Probing can effectively identify the knowledge source behind LLM outputs.
  • The AttriWiki data pipeline generates labeled examples for training attribution models.
  • Attribution mismatches can significantly increase error rates in model outputs.
  • Models may still produce incorrect answers even with accurate attribution.
  • Improving knowledge attribution is essential for enhancing model reliability.

Computer Science > Computation and Language arXiv:2602.22787 (cs) [Submitted on 26 Feb 2026] Title:Probing for Knowledge Attribution in Large Language Models Authors:Ivo Brink, Alexander Boer, Dennis Ulmer View a PDF of the paper titled Probing for Knowledge Attribution in Large Language Models, by Ivo Brink and 2 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) often generate fluent but unfounded claims, or hallucinations, which fall into two types: (i) faithfulness violations - misusing user context - and (ii) factuality violations - errors from internal knowledge. Proper mitigation depends on knowing whether a model's answer is based on the prompt or its internal weights. This work focuses on the problem of contributive attribution: identifying the dominant knowledge source behind each output. We show that a probe, a simple linear classifier trained on model hidden representations, can reliably predict contributive attribution. For its training, we introduce AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read them from context, generating labelled examples automatically. Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retraining. Attribution mismatches raise error rates by up to 70%, demonstratin...

Related Articles

I Asked ChatGPT 500 Questions. Here Are the Ads I Saw Most Often | WIRED
Llms

I Asked ChatGPT 500 Questions. Here Are the Ads I Saw Most Often | WIRED

Ads are rolling out across the US on ChatGPT’s free tier. I asked OpenAI's bot 500 questions to see what these ads were like and how they...

Wired - AI · 9 min ·
Llms

Abacus.Ai Claw LLM consumes an incredible amount of credit without any usage :(

Three days ago, I clicked the "Deploy OpenClaw In Seconds" button to get an overview of the new service, but I didn't build any automatio...

Reddit - Artificial Intelligence · 1 min ·
Google’s Gemini AI app debuts in Hong Kong
Llms

Google’s Gemini AI app debuts in Hong Kong

Tech giant’s chatbot service tops Apple’s app store chart in the city.

AI Tools & Products · 2 min ·
Google Launches Gemini Import Tools to Poach Users From Rival AI Apps
Llms

Google Launches Gemini Import Tools to Poach Users From Rival AI Apps

Anyone looking to switch their AI assistant will find it surprisingly easy, as it only takes a few steps to move from A to B. This is not...

AI Tools & Products · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime