[2602.20878] Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs
Summary
This article introduces Vision-Language Causal Graphs (VLCGs) to enhance causal reasoning in Vision-Language Models (LVLMs), addressing their reliance on spurious correlations.
Why It Matters
Understanding and improving causal reasoning in LVLMs is crucial for advancing AI's ability to interpret and interact with visual and textual data accurately. This research provides a framework for better evaluation and enhancement of these models, which is essential for applications in AI safety and reliability.
Key Takeaways
- Current LVLMs often misidentify causally relevant information.
- VLCGs provide a structured representation for better causal reasoning.
- The ViLCaR benchmark improves evaluation of causal attribution and inference.
- Injecting structured relevance information enhances model performance.
- Limitations in LVLMs stem from insufficient structural guidance, not reasoning capacity.
Computer Science > Artificial Intelligence arXiv:2602.20878 (cs) [Submitted on 24 Feb 2026] Title:Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs Authors:Dhita Putri Pratama, Soyeon Caren Han, Yihao Ding View a PDF of the paper titled Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs, by Dhita Putri Pratama and 1 other authors View PDF HTML (experimental) Abstract:Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning. Existing evaluations primarily assess the correctness of the answers, making it unclear whether failures arise from limited reasoning capability or from misidentifying causally relevant information. We introduce Vision-Language Causal Graphs (VLCGs), a structured, query-conditioned representation that explicitly encodes causally relevant objects, attributes, relations, and scene-grounded assumptions. Building on this representation, we present ViLCaR, a diagnostic benchmark comprising tasks for Causal Attribution, Causal Inference, and Question Answering, along with graph-aligned evaluation metrics that assess relevance identification beyond final answer accuracy. Experiments in state-of-the-art LVLMs show that injecting structured relevance information significantly improves attribution and inference consistency compared to zero-shot and standard in-context le...