[2602.23136] Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs
Summary
This article explores the concept of modality collapse in multimodal large language models (LLMs), highlighting the limitations of decoders in extracting information from non-text inputs and proposing solutions to improve their performance.
Why It Matters
Understanding modality collapse is crucial for advancing multimodal AI systems. The findings reveal inherent limitations in current LLM architectures and underscore the importance of training objectives in enhancing the accessibility of diverse information types, which could lead to more effective AI applications in various fields.
Key Takeaways
- Multimodal LLMs struggle with extracting non-text information due to decoder limitations.
- The presence of modality-specific variance in decoders is often noise, not useful information.
- Training objectives significantly influence the accessibility of different types of information.
- Generalized Mutual Information (GMI) bounds the accessible information based on the decoder's scoring rule.
- Interventions like emotion-focused training can enhance specific attribute accessibility without compromising others.
Computer Science > Computation and Language arXiv:2602.23136 (cs) [Submitted on 26 Feb 2026] Title:Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs Authors:Jayadev Billa View a PDF of the paper titled Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs, by Jayadev Billa View PDF HTML (experimental) Abstract:Multimodal LLMs can process speech and images, but they cannot hear a speaker's voice or see an object's texture. We show this is not a failure of encoding: speaker identity, emotion, and visual attributes survive through every LLM layer (3--55$\times$ above chance in linear probes), yet removing 64--71% of modality-specific variance improves decoder loss. The decoder has no learned use for these directions; their presence is noise. We formalize this as a mismatched decoder problem: a decoder trained on text can only extract information along text-aligned directions. Accessible information is bounded by the Generalized Mutual Information (GMI), with degradation scaling with distributional distance and decoder sensitivity. The bound is a property of the decoder's scoring rule, not of any particular architecture; it applies whether non-text inputs arrive through a learned projection, a discrete codebook, or no explicit adapter at all. We validate this across five models spanning speech and vision. A controlled experiment (two Prismatic VLMs differing only in encoder text-alignment) confirms the...