[2601.01322] LinMU: Multimodal Understanding Made Linear
About this article
Abstract page for arXiv paper 2601.01322: LinMU: Multimodal Understanding Made Linear
Computer Science > Computer Vision and Pattern Recognition arXiv:2601.01322 (cs) [Submitted on 4 Jan 2026 (v1), last revised 3 May 2026 (this version, v2)] Title:LinMU: Multimodal Understanding Made Linear Authors:Hongjie Wang, Niraj K. Jha View a PDF of the paper titled LinMU: Multimodal Understanding Made Linear, by Hongjie Wang and 1 other authors View PDF HTML (experimental) Abstract:Modern Vision-Language Models (VLMs) achieve impressive performance but are limited by the quadratic complexity of self-attention, which prevents their deployment on edge devices and makes their understanding of high-resolution images and long-context videos prohibitively expensive. To address this challenge, we introduce LinMU (Linear-complexity Multimodal Understanding), a VLM design that achieves linear complexity for the language model decoder without using any quadratic-complexity modules while maintaining the performance of global-attention-based VLMs. LinMU replaces every self-attention layer in the language model decoder with an M-MATE block: a dual-branch module that combines a bidirectional state-space model for global context (Flex-MA branch) with localized Swin-style window attention (Local-Swin branch) for adjacent correlations. To transform a pre-trained VLM into the LinMU architecture, we propose a three-stage distillation framework that (i) initializes both branches with self-attention weights and trains the Flex-MA branch alone, (ii) unfreezes the Local-Swin branch and fin...