Llms Machine Learning Nlp Ai Infrastructure

[2602.12546] Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR

arXiv - AI February 16, 2026 3 min read Article

Summary

The paper presents a decoder-only Conformer model for automatic speech recognition (ASR) that integrates speech and text processing without external encoders, achieving improved word error rates (WER) through a novel modality-aware sparse mixture of experts approach.

Why It Matters

This research is significant as it proposes a new architecture for ASR that enhances performance while reducing complexity. By eliminating the need for external models and achieving better accuracy with fewer parameters, it opens avenues for more efficient speech recognition systems, which are crucial in various applications such as voice assistants and transcription services.

Key Takeaways

Introduces a decoder-only Conformer model for ASR that processes both speech and text.
Utilizes modality-aware sparse mixtures of experts for improved efficiency.
Achieves lower word error rates compared to existing models without external encoders.
Demonstrates effectiveness across multiple languages with a single multilingual model.
First decoder-only ASR model to surpass strong baselines using this approach.

Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2602.12546 (eess) [Submitted on 13 Feb 2026] Title:Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR Authors:Jaeyoung Lee, Masato Mimura View a PDF of the paper titled Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR, by Jaeyoung Lee and 1 other authors View PDF HTML (experimental) Abstract:We present a decoder-only Conformer for automatic speech recognition (ASR) that processes speech and text in a single stack without external speech encoders or pretrained large language models (LLM). The model uses a modality-aware sparse mixture of experts (MoE): disjoint expert pools for speech and text with hard routing and top-1 selection, embedded in hybrid-causality Conformer blocks (bidirectional for speech, causal for text). Training combines CTC on speech positions with label-smoothed cross-entropy for text generation. Our 113M-parameter model consistently improves WER over a 139M AED baseline on Librispeech (2.8% vs. 3.2% test-clean; 5.6% vs. 6.0% test-other). On Common Voice 16.1 with a single multilingual model across five languages, our approach reduces average WER from 12.2% to 10.6%. To our knowledge, this is the first randomly initialized decoder-only ASR that surpasses strong AED baselines via modality-aware routing and sparse MoE, achieving better accuracy with fewer active parameters and without alignment/adaptation modules. Comments:...

Read Original Article

[2602.12546] Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR

Summary

Why It Matters

Key Takeaways

Related Articles

Anthropic Teams Up With Its Rivals to Keep AI From Hacking Everything | WIRED

The public needs to control AI-run infrastructure, labor, education, and governance— NOT private actors

Agents that write their own code at runtime and vote on capabilities, no human in the loop

Google Maps can now write captions for your photos using AI | TechCrunch

No comments

Stay updated with AI News