[2512.11108] Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution
Summary
This article explores the biases inherent in post-hoc feature attribution methods used in language models, revealing how lexical and positional preferences can affect the quality of explanations provided to users.
Why It Matters
Understanding the biases in feature attribution methods is crucial for improving trust in AI systems. This research highlights the variability in explanations and emphasizes the need for better evaluation metrics to enhance the reliability of AI outputs.
Key Takeaways
- Post-hoc feature attribution methods can exhibit significant biases.
- There is a trade-off between lexical and position biases in language models.
- Anomalous explanations are more likely to be biased, affecting user trust.
Computer Science > Computation and Language arXiv:2512.11108 (cs) [Submitted on 11 Dec 2025 (v1), last revised 19 Feb 2026 (this version, v2)] Title:Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution Authors:Jonathan Kamp, Roos Bakker, Dominique Blok View a PDF of the paper titled Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution, by Jonathan Kamp and 2 other authors View PDF HTML (experimental) Abstract:Good quality explanations strengthen the understanding of language models and data. Feature attribution methods, such as Integrated Gradient, are a type of post-hoc explainer that can provide token-level insights. However, explanations on the same input may vary greatly due to underlying biases of different methods. Users may be aware of this issue and mistrust their utility, while unaware users may trust them inadequately. In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics. We systematically assess both lexical and position bias (what and where in the input) for two transformers; first, in a controlled, pseudo-random classification task on artificial data; then, in a semi-controlled causal relation detection task on natural data. We find a trade-off between lexical and position biases in our model compar...