[2602.14869] Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution
Summary
The paper introduces Concept Influence, a method to enhance training data attribution by leveraging interpretability, improving performance and efficiency in identifying data influences on model behavior.
Why It Matters
As large language models evolve, understanding the influence of training data on model behavior is crucial for mitigating unintended outcomes. This research offers a scalable and interpretable approach to data attribution, which can enhance model reliability and transparency, addressing key challenges in AI safety and performance.
Key Takeaways
- Concept Influence attributes model behavior to semantic directions rather than individual examples.
- The method is significantly faster than traditional influence functions, enhancing scalability.
- Empirical validation shows comparable performance to classical methods while improving explainability.
- Incorporating interpretable structures can lead to better control over model behavior.
- This approach addresses key issues in training data attribution, particularly in large language models.
Computer Science > Artificial Intelligence arXiv:2602.14869 (cs) [Submitted on 16 Feb 2026] Title:Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution Authors:Matthew Kowal, Goncalo Paulo, Louis Jaburi, Tom Tseng, Lev E McKinney, Stefan Heimersheim, Aaron David Tucker, Adam Gleave, Kellin Pelrine View a PDF of the paper titled Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution, by Matthew Kowal and 8 other authors View PDF HTML (experimental) Abstract:As large language models are increasingly trained and fine-tuned, practitioners need methods to identify which training data drive specific behaviors, particularly unintended ones. Training Data Attribution (TDA) methods address this by estimating datapoint influence. Existing approaches like influence functions are both computationally expensive and attribute based on single test examples, which can bias results toward syntactic rather than semantic similarity. To address these issues of scalability and influence to abstract behavior, we leverage interpretable structures within the model during the attribution. First, we introduce Concept Influence which attribute model behavior to semantic directions (such as linear probes or sparse autoencoder features) rather than individual test examples. Second, we show that simple probe-based attribution methods are first-order approximations of Concept Influence ...