[2604.03307] V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
About this article
Abstract page for arXiv paper 2604.03307: V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
Computer Science > Computer Vision and Pattern Recognition arXiv:2604.03307 (cs) [Submitted on 31 Mar 2026] Title:V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators Authors:Jiazhou Zhou, Yucheng Chen, Hongyang Li, Qing Jiang, Hu Zhou, Ying-Cong Chen, Lei Zhang View a PDF of the paper titled V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators, by Jiazhou Zhou and 6 other authors View PDF HTML (experimental) Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable success, yet they remain prone to perception-related hallucinations in fine-grained tasks. This vulnerability arises from a fundamental limitation: their reasoning is largely restricted to the language domain, treating visual input as a static, reasoning-agnostic preamble rather than a dynamic participant. Consequently, current models act as passive observers, unable to re-examine visual details to ground their evolving reasoning states. To overcome this, we propose V-Reflection, a framework that transforms the MLLM into an active interrogator through a "think-then-look" visual reflection mechanism. During reasoning, latent states function as dynamic probes that actively interrogate the visual feature space, grounding each reasoning step for task-critical evidence. Our approach employs a two-stage distillation strategy. First, the Box-Guided Compression (BCM) module establishes stable pixel-to-latent targets through explicit spatial groundin...