[2508.00576] MultiSHAP: A Shapley-Based Framework for Explaining Cross-Modal Interactions in Multimodal AI Models
Summary
MultiSHAP introduces a Shapley-based framework for explaining interactions in multimodal AI models, enhancing interpretability and trustworthiness in AI applications.
Why It Matters
As multimodal AI models become integral in various applications, understanding their decision-making processes is crucial. MultiSHAP addresses the interpretability challenge, providing insights into how different modalities interact, which is essential for deploying AI in high-stakes environments.
Key Takeaways
- MultiSHAP offers a model-agnostic framework for interpreting multimodal AI models.
- It quantifies interactions between visual and textual elements, revealing synergistic effects.
- The framework provides both instance-level and dataset-level explanations.
- MultiSHAP is applicable to both open- and closed-source models, enhancing its utility.
- Real-world case studies demonstrate the practical benefits of using MultiSHAP.
Computer Science > Artificial Intelligence arXiv:2508.00576 (cs) [Submitted on 1 Aug 2025 (v1), last revised 16 Feb 2026 (this version, v2)] Title:MultiSHAP: A Shapley-Based Framework for Explaining Cross-Modal Interactions in Multimodal AI Models Authors:Zhanliang Wang, Kai Wang View a PDF of the paper titled MultiSHAP: A Shapley-Based Framework for Explaining Cross-Modal Interactions in Multimodal AI Models, by Zhanliang Wang and 1 other authors View PDF HTML (experimental) Abstract:Multimodal AI models have achieved impressive performance in tasks that require integrating information from multiple modalities, such as vision and language. However, their "black-box" nature poses a major barrier to deployment in high-stakes applications where interpretability and trustworthiness are essential. How to explain cross-modal interactions in multimodal AI models remains a major challenge. While existing model explanation methods, such as attention map and Grad-CAM, offer coarse insights into cross-modal relationships, they cannot precisely quantify the synergistic effects between modalities, and are limited to open-source models with accessible internal weights. Here we introduce MultiSHAP, a model-agnostic interpretability framework that leverages the Shapley Interaction Index to attribute multimodal predictions to pairwise interactions between fine-grained visual and textual elements (such as image patches and text tokens), while being applicable to both open- and closed-sourc...