[2511.20974] RosettaSpeech: Zero-Shot Speech-to-Speech Translation without Parallel Speech
Summary
RosettaSpeech introduces a zero-shot framework for speech-to-speech translation, overcoming the need for parallel speech data by using monolingual speech-text data and machine translation supervision.
Why It Matters
This research addresses a significant limitation in speech-to-speech translation systems, which often rely on scarce parallel datasets. By utilizing a novel approach that leverages text as a semantic bridge, RosettaSpeech enhances translation capabilities, particularly for languages with limited resources, thus broadening accessibility and usability in multilingual contexts.
Key Takeaways
- RosettaSpeech achieves state-of-the-art zero-shot performance in speech-to-speech translation.
- The model maintains the source speaker's voice without needing paired speech data.
- It effectively scales to many-to-one translation, benefiting 'text-rich, speech-poor' languages.
- Empirical evaluations show significant performance gains over leading baselines.
- The approach could revolutionize translation technologies by reducing data bottlenecks.
Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2511.20974 (eess) [Submitted on 26 Nov 2025 (v1), last revised 15 Feb 2026 (this version, v2)] Title:RosettaSpeech: Zero-Shot Speech-to-Speech Translation without Parallel Speech Authors:Zhisheng Zheng, Xiaohang Sun, Tuan Dinh, Abhishek Yanamandra, Abhinav Jain, Zhu Liu, Sunil Hadap, Vimal Bhat, Manoj Aggarwal, Gerard Medioni, David Harwath View a PDF of the paper titled RosettaSpeech: Zero-Shot Speech-to-Speech Translation without Parallel Speech, by Zhisheng Zheng and 10 other authors View PDF HTML (experimental) Abstract:End-to-end speech-to-speech translation (S2ST) systems typically struggle with a critical data bottleneck: the scarcity of parallel speech-to-speech corpora. To overcome this, we introduce RosettaSpeech, a novel zero-shot framework trained exclusively on monolingual speech-text data augmented by machine translation supervision. Unlike prior works that rely on complex cascaded pseudo-labeling, our approach strategically utilizes text as a semantic bridge during training to synthesize translation targets, thereby eliminating the need for parallel speech pairs while maintaining a direct, end-to-end inference pipeline. Empirical evaluations on the CVSS-C benchmark demonstrate that RosettaSpeech achieves state-of-the-art zero-shot performance, surpassing leading baselines by significant margins - achieving ASR-BLEU scores of 25.17 for German-to-English (+27% relative gain) and 29....