[2603.00510] What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models
About this article
Abstract page for arXiv paper 2603.00510: What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.00510 (cs) [Submitted on 28 Feb 2026] Title:What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models Authors:Yingqi Fan, Junlong Tong, Anhao Zhao, Xiaoyu Shen View a PDF of the paper titled What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models, by Yingqi Fan and 3 other authors View PDF HTML (experimental) Abstract:Multimodal large language models (MLLMs) project visual tokens into the embedding space of language models, yet the internal structuring and processing of visual semantics remain poorly understood. In this work, we introduce a two-fold analytical framework featuring a novel probing tool, $\textbf{EmbedLens}$, to conduct a fine-grained analysis. We uncover a pronounced semantic sparsity at the input level: visual tokens consistently partition into sink, dead, and alive categories. Remarkably, only the alive tokens, comprising $\approx60\%$ of the total input, carry image-specific meaning. Furthermore, using a targeted patch-compression benchmark, we demonstrate that these alive tokens already encode rich, fine-grained cues (e.g., objects, colors, and OCR) prior to entering the LLM. Internal visual computations (such as visual attention and feed-forward networks) are redundant for most standard tasks. For the small subset of highly vision-centric tasks that actually benefit from internal proc...