[2510.13632] Closing the Gap Between Text and Speech Understanding in LLMs
Summary
This paper addresses the performance gap between text and speech understanding in large language models (LLMs), proposing a new method, SALAD, to enhance alignment and efficiency in training.
Why It Matters
As LLMs increasingly integrate speech processing capabilities, understanding and improving their performance in this area is crucial for applications in AI-driven communication tools. The proposed SALAD method offers a potential solution to enhance model efficiency and effectiveness, which could lead to broader adoption and improved user experiences in speech-related tasks.
Key Takeaways
- The text-speech understanding gap highlights performance issues in speech-adapted LLMs compared to their text-based counterparts.
- Current solutions often rely on costly and non-reproducible datasets, necessitating more data-efficient approaches.
- SALAD combines cross-modal distillation with targeted synthetic data to improve model alignment while reducing data requirements.
- SALAD demonstrates competitive performance with significantly less speech data, making it a promising approach for future LLM developments.
- Understanding the factors driving the text-speech gap can inform better training strategies for LLMs.
Computer Science > Computation and Language arXiv:2510.13632 (cs) [Submitted on 15 Oct 2025 (v1), last revised 23 Feb 2026 (this version, v2)] Title:Closing the Gap Between Text and Speech Understanding in LLMs Authors:Santiago Cuervo, Skyler Seto, Maureen de Seyssel, Richard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly, Zakaria Aldeneh View a PDF of the paper titled Closing the Gap Between Text and Speech Understanding in LLMs, by Santiago Cuervo and 7 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts--and even cascaded pipelines--on language understanding tasks. We term this shortfall the text-speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech...