[2602.12546] Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR
Summary
The paper presents a decoder-only Conformer model for automatic speech recognition (ASR) that integrates speech and text processing without external encoders, achieving improved word error rates (WER) through a novel modality-aware sparse mixture of experts approach.
Why It Matters
This research is significant as it proposes a new architecture for ASR that enhances performance while reducing complexity. By eliminating the need for external models and achieving better accuracy with fewer parameters, it opens avenues for more efficient speech recognition systems, which are crucial in various applications such as voice assistants and transcription services.
Key Takeaways
- Introduces a decoder-only Conformer model for ASR that processes both speech and text.
- Utilizes modality-aware sparse mixtures of experts for improved efficiency.
- Achieves lower word error rates compared to existing models without external encoders.
- Demonstrates effectiveness across multiple languages with a single multilingual model.
- First decoder-only ASR model to surpass strong baselines using this approach.
Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2602.12546 (eess) [Submitted on 13 Feb 2026] Title:Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR Authors:Jaeyoung Lee, Masato Mimura View a PDF of the paper titled Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR, by Jaeyoung Lee and 1 other authors View PDF HTML (experimental) Abstract:We present a decoder-only Conformer for automatic speech recognition (ASR) that processes speech and text in a single stack without external speech encoders or pretrained large language models (LLM). The model uses a modality-aware sparse mixture of experts (MoE): disjoint expert pools for speech and text with hard routing and top-1 selection, embedded in hybrid-causality Conformer blocks (bidirectional for speech, causal for text). Training combines CTC on speech positions with label-smoothed cross-entropy for text generation. Our 113M-parameter model consistently improves WER over a 139M AED baseline on Librispeech (2.8% vs. 3.2% test-clean; 5.6% vs. 6.0% test-other). On Common Voice 16.1 with a single multilingual model across five languages, our approach reduces average WER from 12.2% to 10.6%. To our knowledge, this is the first randomly initialized decoder-only ASR that surpasses strong AED baselines via modality-aware routing and sparse MoE, achieving better accuracy with fewer active parameters and without alignment/adaptation modules. Comments:...