[2603.21701] Rethinking Token Reduction for Large Vision-Language Models
About this article
Abstract page for arXiv paper 2603.21701: Rethinking Token Reduction for Large Vision-Language Models
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.21701 (cs) [Submitted on 23 Mar 2026] Title:Rethinking Token Reduction for Large Vision-Language Models Authors:Yi Wang, Haofei Zhang, Qihan Huang, Anda Cao, Gongfan Fang, Wei Wang, Xuan Jin, Jie Song, Mingli Song, Xinchao Wang View a PDF of the paper titled Rethinking Token Reduction for Large Vision-Language Models, by Yi Wang and 9 other authors View PDF HTML (experimental) Abstract:Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by ...