[2603.23914] Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding
About this article
Abstract page for arXiv paper 2603.23914: Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.23914 (cs) [Submitted on 25 Mar 2026] Title:Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding Authors:Fatih Ilhan, Gaowen Liu, Ramana Rao Kompella, Selim Furkan Tekin, Tiansheng Huang, Zachary Yahn, Yichang Xu, Ling Liu View a PDF of the paper titled Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding, by Fatih Ilhan and 7 other authors View PDF HTML (experimental) Abstract:Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and answer of VLMs consist of long sequences of visual and text tokens. This paper presents AttentionPack, an adaptive and attention-aware optimization framework tailored for large vision-language models with improving memory-efficiency during decoding, focusing on addressing the challenges due to the increased high number of visual inputs and interactions, particularly in long-context tasks with multiple high-resolution images or videos. AttentionPack is novel in two aspects: (i) We introduce a multi-head attention compaction method for economically storing key and value matrices by exploiting the implicit low-rank structure, and (ii) we develop a token-specific attention-aware decompression mechanism...