[2602.11858] Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

[2602.11858] Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

arXiv - AI 4 min read Article

Summary

The paper presents Region-to-Image Distillation, a novel approach to enhance fine-grained multimodal perception in MLLMs by internalizing zooming benefits into a single inference pass.

Why It Matters

This research addresses the limitations of existing multimodal models in fine-grained perception tasks, which are crucial for applications requiring detailed visual understanding. By optimizing the inference process, it enhances the efficiency and effectiveness of MLLMs, making them more applicable in real-world scenarios.

Key Takeaways

  • Introduces Region-to-Image Distillation to improve fine-grained perception.
  • Eliminates the need for repeated zooming during inference, reducing latency.
  • Presents ZoomBench, a benchmark for evaluating fine-grained perception.
  • Demonstrates improved performance on multiple perception benchmarks.
  • Discusses the applicability of 'Thinking-with-Images' in various contexts.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.11858 (cs) [Submitted on 12 Feb 2026 (v1), last revised 16 Feb 2026 (this version, v2)] Title:Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception Authors:Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li, Huijia Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, Zhuosheng Zhang, Weiran Huang View a PDF of the paper titled Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception, by Lai Wei and 11 other authors View PDF HTML (experimental) Abstract:Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model impro...

Related Articles

[2603.23966] Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage
Llms

[2603.23966] Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage

Abstract page for arXiv paper 2603.23966: Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage

arXiv - AI · 4 min ·
[2603.16790] InCoder-32B: Code Foundation Model for Industrial Scenarios
Llms

[2603.16790] InCoder-32B: Code Foundation Model for Industrial Scenarios

Abstract page for arXiv paper 2603.16790: InCoder-32B: Code Foundation Model for Industrial Scenarios

arXiv - AI · 4 min ·
[2603.16430] EngGPT2: Sovereign, Efficient and Open Intelligence
Llms

[2603.16430] EngGPT2: Sovereign, Efficient and Open Intelligence

Abstract page for arXiv paper 2603.16430: EngGPT2: Sovereign, Efficient and Open Intelligence

arXiv - AI · 4 min ·
[2603.11066] Exploring Collatz Dynamics with Human-LLM Collaboration
Llms

[2603.11066] Exploring Collatz Dynamics with Human-LLM Collaboration

Abstract page for arXiv paper 2603.11066: Exploring Collatz Dynamics with Human-LLM Collaboration

arXiv - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime