[2601.01322] LinMU: Multimodal Understanding Made Linear

arXiv - Machine Learning May 05, 2026 4 min read

About this article

Abstract page for arXiv paper 2601.01322: LinMU: Multimodal Understanding Made Linear

Computer Science > Computer Vision and Pattern Recognition arXiv:2601.01322 (cs) [Submitted on 4 Jan 2026 (v1), last revised 3 May 2026 (this version, v2)] Title:LinMU: Multimodal Understanding Made Linear Authors:Hongjie Wang, Niraj K. Jha View a PDF of the paper titled LinMU: Multimodal Understanding Made Linear, by Hongjie Wang and 1 other authors View PDF HTML (experimental) Abstract:Modern Vision-Language Models (VLMs) achieve impressive performance but are limited by the quadratic complexity of self-attention, which prevents their deployment on edge devices and makes their understanding of high-resolution images and long-context videos prohibitively expensive. To address this challenge, we introduce LinMU (Linear-complexity Multimodal Understanding), a VLM design that achieves linear complexity for the language model decoder without using any quadratic-complexity modules while maintaining the performance of global-attention-based VLMs. LinMU replaces every self-attention layer in the language model decoder with an M-MATE block: a dual-branch module that combines a bidirectional state-space model for global context (Flex-MA branch) with localized Swin-style window attention (Local-Swin branch) for adjacent correlations. To transform a pre-trained VLM into the LinMU architecture, we propose a three-stage distillation framework that (i) initializes both branches with self-attention weights and trains the Flex-MA branch alone, (ii) unfreezes the Local-Swin branch and fin...

Originally published on May 05, 2026. Curated by AI News.

Llms

[2602.07238] Is there "Secret Sauce'' in Large Language Model Development?

Abstract page for arXiv paper 2602.07238: Is there "Secret Sauce'' in Large Language Model Development?

arXiv - Machine Learning · 3 min · about 5 hours ago

Llms

[2602.01203] Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

Abstract page for arXiv paper 2602.01203: Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

arXiv - Machine Learning · 4 min · about 5 hours ago

Llms

[2512.05525] Poodle: Seamlessly Scaling Down Large Language Models with Just-in-Time Model Replacement

Abstract page for arXiv paper 2512.05525: Poodle: Seamlessly Scaling Down Large Language Models with Just-in-Time Model Replacement

arXiv - Machine Learning · 4 min · about 5 hours ago

Llms

[2511.21678] Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

Abstract page for arXiv paper 2511.21678: Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

arXiv - Machine Learning · 4 min · about 5 hours ago

[2601.01322] LinMU: Multimodal Understanding Made Linear

About this article

Related Articles

[2602.07238] Is there "Secret Sauce'' in Large Language Model Development?

[2602.01203] Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

[2512.05525] Poodle: Seamlessly Scaling Down Large Language Models with Just-in-Time Model Replacement

[2511.21678] Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

No comments

Stay updated with AI News