[2604.06912] Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
About this article
Abstract page for arXiv paper 2604.06912: Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Computer Science > Computer Vision and Pattern Recognition arXiv:2604.06912 (cs) [Submitted on 8 Apr 2026] Title:Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models Authors:Yuheng Shi, Xiaohuan Pei, Linfeng Wen, Minjing Dong, Chang Xu View a PDF of the paper titled Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models, by Yuheng Shi and 4 other authors View PDF HTML (experimental) Abstract:MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency-aware generation strategy to derive deterministic routing labels, while the SD-RPN employs a fully self-superv...