[2602.11858] Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Summary
The paper presents Region-to-Image Distillation, a novel approach to enhance fine-grained multimodal perception in MLLMs by internalizing zooming benefits into a single inference pass.
Why It Matters
This research addresses the limitations of existing multimodal models in fine-grained perception tasks, which are crucial for applications requiring detailed visual understanding. By optimizing the inference process, it enhances the efficiency and effectiveness of MLLMs, making them more applicable in real-world scenarios.
Key Takeaways
- Introduces Region-to-Image Distillation to improve fine-grained perception.
- Eliminates the need for repeated zooming during inference, reducing latency.
- Presents ZoomBench, a benchmark for evaluating fine-grained perception.
- Demonstrates improved performance on multiple perception benchmarks.
- Discusses the applicability of 'Thinking-with-Images' in various contexts.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.11858 (cs) [Submitted on 12 Feb 2026 (v1), last revised 16 Feb 2026 (this version, v2)] Title:Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception Authors:Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li, Huijia Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, Zhuosheng Zhang, Weiran Huang View a PDF of the paper titled Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception, by Lai Wei and 11 other authors View PDF HTML (experimental) Abstract:Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model impro...