[2602.12546] Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR

[2602.12546] Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR

arXiv - AI 3 min read Article

Summary

The paper presents a decoder-only Conformer model for automatic speech recognition (ASR) that integrates speech and text processing without external encoders, achieving improved word error rates (WER) through a novel modality-aware sparse mixture of experts approach.

Why It Matters

This research is significant as it proposes a new architecture for ASR that enhances performance while reducing complexity. By eliminating the need for external models and achieving better accuracy with fewer parameters, it opens avenues for more efficient speech recognition systems, which are crucial in various applications such as voice assistants and transcription services.

Key Takeaways

  • Introduces a decoder-only Conformer model for ASR that processes both speech and text.
  • Utilizes modality-aware sparse mixtures of experts for improved efficiency.
  • Achieves lower word error rates compared to existing models without external encoders.
  • Demonstrates effectiveness across multiple languages with a single multilingual model.
  • First decoder-only ASR model to surpass strong baselines using this approach.

Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2602.12546 (eess) [Submitted on 13 Feb 2026] Title:Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR Authors:Jaeyoung Lee, Masato Mimura View a PDF of the paper titled Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR, by Jaeyoung Lee and 1 other authors View PDF HTML (experimental) Abstract:We present a decoder-only Conformer for automatic speech recognition (ASR) that processes speech and text in a single stack without external speech encoders or pretrained large language models (LLM). The model uses a modality-aware sparse mixture of experts (MoE): disjoint expert pools for speech and text with hard routing and top-1 selection, embedded in hybrid-causality Conformer blocks (bidirectional for speech, causal for text). Training combines CTC on speech positions with label-smoothed cross-entropy for text generation. Our 113M-parameter model consistently improves WER over a 139M AED baseline on Librispeech (2.8% vs. 3.2% test-clean; 5.6% vs. 6.0% test-other). On Common Voice 16.1 with a single multilingual model across five languages, our approach reduces average WER from 12.2% to 10.6%. To our knowledge, this is the first randomly initialized decoder-only ASR that surpasses strong AED baselines via modality-aware routing and sparse MoE, achieving better accuracy with fewer active parameters and without alignment/adaptation modules. Comments:...

Related Articles

Anthropic Teams Up With Its Rivals to Keep AI From Hacking Everything | WIRED
Llms

Anthropic Teams Up With Its Rivals to Keep AI From Hacking Everything | WIRED

The AI lab's Project Glasswing will bring together Apple, Google, and more than 45 other organizations. They'll use the new Claude Mythos...

Wired - AI · 7 min ·
Llms

The public needs to control AI-run infrastructure, labor, education, and governance— NOT private actors

A lot of discussion around AI is becoming siloed, and I think that is dangerous. People in AI-focused spaces often talk as if the only qu...

Reddit - Artificial Intelligence · 1 min ·
Llms

Agents that write their own code at runtime and vote on capabilities, no human in the loop

hollowOS just hit v4.4 and I added something that I haven’t seen anyone else do. Previous versions gave you an OS for agents: structured ...

Reddit - Artificial Intelligence · 1 min ·
Google Maps can now write captions for your photos using AI | TechCrunch
Llms

Google Maps can now write captions for your photos using AI | TechCrunch

Gemini can now create captions when users are looking to share a photo or video.

TechCrunch - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime