Llms Machine Learning Nlp Ai Safety Data Science

[2602.14869] Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

arXiv - AI February 17, 2026 4 min read Article

Summary

The paper introduces Concept Influence, a method to enhance training data attribution by leveraging interpretability, improving performance and efficiency in identifying data influences on model behavior.

Why It Matters

As large language models evolve, understanding the influence of training data on model behavior is crucial for mitigating unintended outcomes. This research offers a scalable and interpretable approach to data attribution, which can enhance model reliability and transparency, addressing key challenges in AI safety and performance.

Key Takeaways

Concept Influence attributes model behavior to semantic directions rather than individual examples.
The method is significantly faster than traditional influence functions, enhancing scalability.
Empirical validation shows comparable performance to classical methods while improving explainability.
Incorporating interpretable structures can lead to better control over model behavior.
This approach addresses key issues in training data attribution, particularly in large language models.

Computer Science > Artificial Intelligence arXiv:2602.14869 (cs) [Submitted on 16 Feb 2026] Title:Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution Authors:Matthew Kowal, Goncalo Paulo, Louis Jaburi, Tom Tseng, Lev E McKinney, Stefan Heimersheim, Aaron David Tucker, Adam Gleave, Kellin Pelrine View a PDF of the paper titled Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution, by Matthew Kowal and 8 other authors View PDF HTML (experimental) Abstract:As large language models are increasingly trained and fine-tuned, practitioners need methods to identify which training data drive specific behaviors, particularly unintended ones. Training Data Attribution (TDA) methods address this by estimating datapoint influence. Existing approaches like influence functions are both computationally expensive and attribute based on single test examples, which can bias results toward syntactic rather than semantic similarity. To address these issues of scalability and influence to abstract behavior, we leverage interpretable structures within the model during the attribution. First, we introduce Concept Influence which attribute model behavior to semantic directions (such as linear probes or sparse autoencoder features) rather than individual test examples. Second, we show that simple probe-based attribution methods are first-order approximations of Concept Influence ...

Read Original Article

[2602.14869] Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

Summary

Why It Matters

Key Takeaways

Related Articles

[D] Tested model routing on financial AI datasets — good savings and curious what benchmarks others use.

[D] AI research on small language models

One of The Worst AI's I've Ever Seen

Claude Opus 4.6 API at 40% below Anthropic pricing – try free before you pay anything

No comments

Stay updated with AI News