[2512.16378] Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
About this article
Abstract page for arXiv paper 2512.16378: Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
Computer Science > Computation and Language arXiv:2512.16378 (cs) [Submitted on 18 Dec 2025 (v1), last revised 27 Mar 2026 (this version, v3)] Title:Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs Authors:Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano, Gerard I. Gállego, Jorge Iranzo-Sánchez, Ahrii Kim, Dominik Macháček, Patricia Schmidtova, Maike Züfle View a PDF of the paper titled Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs, by Sara Papi and 10 other authors View PDF Abstract:As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which directly process spoken language and enable speech-to-text translation (ST) and other downstream tasks, bypassing traditional transcription-based pipelines. Whether this integration improves ST quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 6 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable solution overall, but most recent Spee...