[2512.02700] VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm
Summary
The paper presents VLM-Pruner, a novel token pruning algorithm designed to enhance the efficiency of vision-language models (VLMs) by balancing redundancy and spatial sparsity, resulting in improved performance and reduced computational costs.
Why It Matters
As vision-language models become increasingly integral to applications in AI, optimizing their performance while reducing computational demands is crucial for deployment on mobile devices and in real-time applications. VLM-Pruner addresses these challenges by improving token selection processes, which can lead to more efficient AI systems.
Key Takeaways
- VLM-Pruner improves token pruning by balancing redundancy and spatial relationships.
- The algorithm achieves an 88.9% pruning rate while maintaining performance.
- A centrifugal token pruning paradigm enhances the selection process for better detail retention.
- The method includes a Buffering for Spatial Sparsity (BSS) criterion to optimize token selection.
- Comprehensive comparisons show VLM-Pruner outperforms existing methods across multiple VLMs.
Computer Science > Computer Vision and Pattern Recognition arXiv:2512.02700 (cs) [Submitted on 2 Dec 2025 (v1), last revised 22 Feb 2026 (this version, v3)] Title:VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm Authors:Zhenkai Wu, Xiaowen Ma, Zhenliang Ni, Dengming Zhang, Han Shu, Xin Jiang, Xinghao Chen View a PDF of the paper titled VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm, by Zhenkai Wu and 6 other authors View PDF HTML (experimental) Abstract:Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a centrifugal token pruning paradigm that enables near-to-far selection while prioritizing the preservation of fine-grained object details. Moreover, we design a Buffering for Spatial Spa...