[2603.06665] Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine
About this article
Abstract page for arXiv paper 2603.06665: Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.06665 (cs) [Submitted on 2 Mar 2026 (v1), last revised 10 Apr 2026 (this version, v2)] Title:Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine Authors:Yuan Wu, Zongxian Yang, Jiayu Qian, Songpan Gao, Guanxing Chen, Qiankun Li, Yu-An Huang, Zhi-An Huang View a PDF of the paper titled Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine, by Yuan Wu and 7 other authors View PDF HTML (experimental) Abstract:Large vision-language models (VLMs) often benefit from chain-of-thought (CoT) prompting in general domains, yet its efficacy in medical vision-language tasks remains underexplored. We report a counter-intuitive trend: on medical visual question answering, CoT frequently underperforms direct answering (DirA) across general-purpose and medical-specific models. We attribute this to a \emph{medical perception bottleneck}: subtle, domain-specific cues can weaken visual grounding, and CoT may compound early perceptual uncertainty rather than correct it. To probe this hypothesis, we introduce two training-free, inference-time grounding interventions: (i) \emph{perception anchoring} via region-of-interest cues and (ii) \emph{description grounding} via high-quality textual guidance. Across multiple benchmarks and model families, these interventions improve accuracy, mitigate CoT degradation, and in several settings reverse the CoT--DirA inversion. Our findings sugge...