[2603.00563] Whisper-MLA: Reducing GPU Memory Consumption of ASR

[2603.00563] Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion

arXiv - AI March 03, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.00563: Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion

Computer Science > Sound arXiv:2603.00563 (cs) [Submitted on 28 Feb 2026] Title:Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion Authors:Sen Zhang, Jianguo Wei, Wenhuan Lu, Xianghu Yue, Wei Li, Qiang Li, Pengcheng Zhao, Ming Cai, Luo Si View a PDF of the paper titled Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion, by Sen Zhang and 8 other authors View PDF HTML (experimental) Abstract:The Transformer-based Whisper model has achieved state-of-the-art performance in Automatic Speech Recognition (ASR). However, its Multi-Head Attention (MHA) mechanism results in significant GPU memory consumption due to the linearly growing Key-Value (KV) cache usage, which is problematic for many applications especially with long-form audio. To address this, we introduce Whisper-MLA, a novel architecture that incorporates Multi-Head Latent Attention (MLA) into the Whisper model. Specifically, we adapt MLA for Whisper's absolute positional embeddings and systematically investigate its application across encoder self-attention, decoder self-attention, and cross-attention modules. Empirical results indicate that applying MLA exclusively to decoder self-attention yields the desired balance between performance and memory efficiency. Our proposed approach allows conversion of a pretrained Whisper model to Whisper-MLA with minimal fine-tuning. Extensive experiments on the LibriSpeech benchmark validate the effectiveness of...

Originally published on March 03, 2026. Curated by AI News.

Llms

Study: LLMs Able to De-Anonymize User Accounts on Reddit, Hacker News & Other "Pseudonymous" Platforms; Report Co-Author Expands, Advises

Advice from the study's co-author: "Be aware that it’s not any single post that identifies you, but the combination of small details acro...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Machine Learning

[D] Best websites for pytorch/numpy interviews

Hello, I’m at the last year of my PHD and I’m starting to prepare interviews. I’m mainly aiming at applied scientist/research engineer or...

Reddit - Machine Learning · 1 min · about 2 hours ago

Llms

[P] Remote sensing foundation models made easy to use.

This project enables the idea of tasking remote sensing models to acquire embeddings like we task satellites to acquire data! https://git...

Reddit - Machine Learning · 1 min · about 3 hours ago

Machine Learning

Can AI truly be creative?

AI has no imagination. “Creativity is the ability to generate novel and valuable ideas or works through the exercise of imagination” http...

Reddit - Artificial Intelligence · 1 min · about 4 hours ago

[2603.00563] Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion

About this article

Related Articles

Study: LLMs Able to De-Anonymize User Accounts on Reddit, Hacker News & Other "Pseudonymous" Platforms; Report Co-Author Expands, Advises

[D] Best websites for pytorch/numpy interviews

[P] Remote sensing foundation models made easy to use.

Can AI truly be creative?

No comments

Stay updated with AI News