[2603.02556] Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs
About this article
Abstract page for arXiv paper 2603.02556: Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.02556 (cs) [Submitted on 3 Mar 2026] Title:Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs Authors:Zhiyu Pan, Yizheng Wu, Jiashen Hua, Junyi Feng, Shaotian Yan, Bing Deng, Zhiguo Cao, Jieping Ye View a PDF of the paper titled Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs, by Zhiyu Pan and 7 other authors View PDF HTML (experimental) Abstract:Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge:~visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual...