[2602.18846] DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference
Summary
DUET-VLM introduces a dual-stage token reduction framework for vision-language models, enhancing efficiency without sacrificing accuracy during training and inference.
Why It Matters
As vision-language models become increasingly vital in AI applications, optimizing their efficiency is crucial. DUET-VLM addresses the computational challenges by reducing token usage while maintaining high accuracy, making it relevant for developers and researchers focused on improving model performance and resource management.
Key Takeaways
- DUET-VLM achieves significant token reduction (up to 93.4%) while maintaining over 97% accuracy.
- The dual-stage compression framework enhances both vision and language processing efficiency.
- Results indicate that DUET-VLM surpasses existing state-of-the-art methods in visual token reduction.
- The framework is versatile and can be integrated into various models, including Video-LLaVA.
- The approach enables robust adaptation to reduced visual inputs without compromising semantic richness.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.18846 (cs) [Submitted on 21 Feb 2026] Title:DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference Authors:Aditya Kumar Singh, Hitesh Kandala, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum View a PDF of the paper titled DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference, by Aditya Kumar Singh and 4 other authors View PDF HTML (experimental) Abstract:Vision-language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual tokens or drop them progressively in language backbone, often trading accuracy for speed. In this work, we propose DUET-VLM, a versatile plug-and-play dual compression framework that consists of (a) vision-only redundancy aware compression of vision encoder's output into information-preserving tokens, followed by (b) layer-wise, salient text-guided dropping of visual tokens within the language backbone to progressively prune less informative tokens. This coordinated token management enables aggressive compression while retaining critical semantics. On LLaVA-1.5-7B, our approach maintains over 99% of baseline accuracy with 67% fewer tokens, and still retains >97% even at 89% reduction. With this dual-stage compression during training, it achieves 99.7% a...