[2603.06665] Better Eyes, Better Thoughts: Why Vision Chain-of-Thought

[2603.06665] Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

arXiv - AI April 13, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.06665: Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

Computer Science > Computer Vision and Pattern Recognition arXiv:2603.06665 (cs) [Submitted on 2 Mar 2026 (v1), last revised 10 Apr 2026 (this version, v2)] Title:Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine Authors:Yuan Wu, Zongxian Yang, Jiayu Qian, Songpan Gao, Guanxing Chen, Qiankun Li, Yu-An Huang, Zhi-An Huang View a PDF of the paper titled Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine, by Yuan Wu and 7 other authors View PDF HTML (experimental) Abstract:Large vision-language models (VLMs) often benefit from chain-of-thought (CoT) prompting in general domains, yet its efficacy in medical vision-language tasks remains underexplored. We report a counter-intuitive trend: on medical visual question answering, CoT frequently underperforms direct answering (DirA) across general-purpose and medical-specific models. We attribute this to a \emph{medical perception bottleneck}: subtle, domain-specific cues can weaken visual grounding, and CoT may compound early perceptual uncertainty rather than correct it. To probe this hypothesis, we introduce two training-free, inference-time grounding interventions: (i) \emph{perception anchoring} via region-of-interest cues and (ii) \emph{description grounding} via high-quality textual guidance. Across multiple benchmarks and model families, these interventions improve accuracy, mitigate CoT degradation, and in several settings reverse the CoT--DirA inversion. Our findings sugge...

Originally published on April 13, 2026. Curated by AI News.

Llms

From LLMs to hallucinations, here’s a simple guide to common AI terms

TechCrunch - AI · 19 min · 10 minutes ago

Llms

[2604.08457] CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

Abstract page for arXiv paper 2604.08457: CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Under...

arXiv - AI · 4 min · about 2 hours ago

Llms

[2604.08110] OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation

Abstract page for arXiv paper 2604.08110: OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmen...

arXiv - AI · 3 min · about 2 hours ago

Llms

[2602.04674] Overstating Attitudes, Ignoring Networks: LLM Biases in Simulating Misinformation Susceptibility

Abstract page for arXiv paper 2602.04674: Overstating Attitudes, Ignoring Networks: LLM Biases in Simulating Misinformation Susceptibility

arXiv - AI · 4 min · about 2 hours ago

[2603.06665] Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

About this article

Related Articles

From LLMs to hallucinations, here’s a simple guide to common AI terms

[2604.08457] CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

[2604.08110] OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation

[2602.04674] Overstating Attitudes, Ignoring Networks: LLM Biases in Simulating Misinformation Susceptibility

No comments

Stay updated with AI News