[2602.17598] The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?
Summary
The paper explores the Cascade Equivalence Hypothesis, examining when speech LLMs function similarly to ASR→LLM pipelines. It highlights the performance of various speech LLMs and their implications for deployment under different conditions.
Why It Matters
Understanding the equivalence between speech LLMs and traditional ASR→LLM pipelines is crucial for optimizing performance in real-world applications. This research can inform developers and researchers about the efficiency and limitations of current speech technologies, particularly in noisy environments.
Key Takeaways
- Current speech LLMs often perform like traditional ASR→LLM cascades.
- Matched-backbone testing reveals significant behavioral equivalence.
- Performance can vary greatly under noise conditions, affecting deployment choices.
- Understanding architecture dependency is key to optimizing speech LLMs.
- The findings can guide future research and development in speech processing.
Computer Science > Computation and Language arXiv:2602.17598 (cs) [Submitted on 19 Feb 2026] Title:The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines? Authors:Jayadev Billa View a PDF of the paper titled The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?, by Jayadev Billa View PDF Abstract:Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\to$LLM cascades. We show this through matched-backbone testing across four speech LLMs and six tasks, controlling for the LLM backbone for the first time. Ultravox is statistically indistinguishable from its matched cascade ($\kappa{=}0.93$); logit lens reveals literal text emerging in hidden states; LEACE concept erasure confirms text representations are causally necessary in both architectures tested, collapsing accuracy to near-zero. Qwen2-Audio genuinely diverges, revealing cascade equivalence is architecture-dependent, not universal. For most deployed use cases, current speech LLMs are expensive cascades, and under noise, they are worse ones, with clean-condition advantages reversing by up to 7.6% at 0 dB. Comments: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS) Cite as: arXiv:2602.17598 [cs.CL] (or arXiv:2602.17598v1 [cs.CL] for this version) https://doi.org/10....