Llms Machine Learning Ai Safety Nlp

[2512.11108] Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution

arXiv - AI February 20, 2026 4 min read Article

Summary

This article explores the biases inherent in post-hoc feature attribution methods used in language models, revealing how lexical and positional preferences can affect the quality of explanations provided to users.

Why It Matters

Understanding the biases in feature attribution methods is crucial for improving trust in AI systems. This research highlights the variability in explanations and emphasizes the need for better evaluation metrics to enhance the reliability of AI outputs.

Key Takeaways

Post-hoc feature attribution methods can exhibit significant biases.
There is a trade-off between lexical and position biases in language models.
Anomalous explanations are more likely to be biased, affecting user trust.

Computer Science > Computation and Language arXiv:2512.11108 (cs) [Submitted on 11 Dec 2025 (v1), last revised 19 Feb 2026 (this version, v2)] Title:Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution Authors:Jonathan Kamp, Roos Bakker, Dominique Blok View a PDF of the paper titled Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution, by Jonathan Kamp and 2 other authors View PDF HTML (experimental) Abstract:Good quality explanations strengthen the understanding of language models and data. Feature attribution methods, such as Integrated Gradient, are a type of post-hoc explainer that can provide token-level insights. However, explanations on the same input may vary greatly due to underlying biases of different methods. Users may be aware of this issue and mistrust their utility, while unaware users may trust them inadequately. In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics. We systematically assess both lexical and position bias (what and where in the input) for two transformers; first, in a controlled, pseudo-random classification task on artificial data; then, in a semi-controlled causal relation detection task on natural data. We find a trade-off between lexical and position biases in our model compar...

Read Original Article

[2512.11108] Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution

Summary

Why It Matters

Key Takeaways

Related Articles

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED

People anxious about deviating from what AI tells them to do?

ChatGPT on trial: A landmark test of AI liability in the practice of law

What if Claude purposefully made its own code leakable so that it would get leaked

No comments

Stay updated with AI News