[2602.14869] Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

[2602.14869] Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

arXiv - AI 4 min read Article

Summary

The paper introduces Concept Influence, a method to enhance training data attribution by leveraging interpretability, improving performance and efficiency in identifying data influences on model behavior.

Why It Matters

As large language models evolve, understanding the influence of training data on model behavior is crucial for mitigating unintended outcomes. This research offers a scalable and interpretable approach to data attribution, which can enhance model reliability and transparency, addressing key challenges in AI safety and performance.

Key Takeaways

  • Concept Influence attributes model behavior to semantic directions rather than individual examples.
  • The method is significantly faster than traditional influence functions, enhancing scalability.
  • Empirical validation shows comparable performance to classical methods while improving explainability.
  • Incorporating interpretable structures can lead to better control over model behavior.
  • This approach addresses key issues in training data attribution, particularly in large language models.

Computer Science > Artificial Intelligence arXiv:2602.14869 (cs) [Submitted on 16 Feb 2026] Title:Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution Authors:Matthew Kowal, Goncalo Paulo, Louis Jaburi, Tom Tseng, Lev E McKinney, Stefan Heimersheim, Aaron David Tucker, Adam Gleave, Kellin Pelrine View a PDF of the paper titled Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution, by Matthew Kowal and 8 other authors View PDF HTML (experimental) Abstract:As large language models are increasingly trained and fine-tuned, practitioners need methods to identify which training data drive specific behaviors, particularly unintended ones. Training Data Attribution (TDA) methods address this by estimating datapoint influence. Existing approaches like influence functions are both computationally expensive and attribute based on single test examples, which can bias results toward syntactic rather than semantic similarity. To address these issues of scalability and influence to abstract behavior, we leverage interpretable structures within the model during the attribution. First, we introduce Concept Influence which attribute model behavior to semantic directions (such as linear probes or sparse autoencoder features) rather than individual test examples. Second, we show that simple probe-based attribution methods are first-order approximations of Concept Influence ...

Related Articles

Llms

[D] Tested model routing on financial AI datasets — good savings and curious what benchmarks others use.

Ran a benchmark evaluating whether prompt complexity-based routing delivers meaningful savings. Used public HuggingFace datasets. Here's ...

Reddit - Machine Learning · 1 min ·
Llms

[D] AI research on small language models

i'm doing research on some trending fields in AI, currently working on small language models and would love to meet people who are workin...

Reddit - Machine Learning · 1 min ·
Llms

One of The Worst AI's I've Ever Seen

I'm using Gemini just for they gave us a student-free-pro pack. It can't see the images I sent, most of the time it just rewrites the mes...

Reddit - Artificial Intelligence · 1 min ·
Llms

Claude Opus 4.6 API at 40% below Anthropic pricing – try free before you pay anything

Hey everyone 👋 I've set up a self-hosted API gateway using New-API to manage and distribute Claude Opus 4.6 access across multiple users....

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime