[2604.06250] DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs
About this article
Abstract page for arXiv paper 2604.06250: DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs
Computer Science > Computer Vision and Pattern Recognition arXiv:2604.06250 (cs) [Submitted on 6 Apr 2026] Title:DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs Authors:Dikshant Kukreja, Kshitij Sah, Karan Goyal, Mukesh Mohania, Vikram Goyal View a PDF of the paper titled DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs, by Dikshant Kukreja and 4 other authors View PDF HTML (experimental) Abstract:When asked to describe a molecular diagram, a Vision-Language Model correctly identifies ``a benzene ring with an -OH group.'' When asked to reason about the same image, it answers incorrectly. The model can see but it cannot think about what it sees. We term this the perception-integration gap: a failure where visual information is successfully extracted but lost during downstream reasoning, invisible to single-configuration benchmarks that conflate perception with integration under one accuracy number. To systematically expose such failures, we introduce DISSECT, a 12,000-question diagnostic benchmark spanning Chemistry (7,000) and Biology (5,000). Every question is evaluated under five input modes -- Vision+Text, Text-Only, Vision-Only, Human Oracle, and a novel Model Oracle in which the VLM first verbalizes the image and then reasons from its own description -- yielding diagnostic gaps that decompose performance into language-prior exploitation, visual extraction, perception fidelity, and integration effecti...