[2512.03794] AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
About this article
Abstract page for arXiv paper 2512.03794: AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
Computer Science > Computer Vision and Pattern Recognition arXiv:2512.03794 (cs) [Submitted on 3 Dec 2025 (v1), last revised 28 Feb 2026 (this version, v2)] Title:AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition Authors:Zichuan Lin, Yicheng Liu, Yang Yang, Lvfang Tao, Deheng Ye View a PDF of the paper titled AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition, by Zichuan Lin and 4 other authors View PDF HTML (experimental) Abstract:Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Cen...