Machine Learning Nlp Generative Ai

[2511.20974] RosettaSpeech: Zero-Shot Speech-to-Speech Translation without Parallel Speech

arXiv - Machine Learning February 17, 2026 4 min read Article

Summary

RosettaSpeech introduces a zero-shot framework for speech-to-speech translation, overcoming the need for parallel speech data by using monolingual speech-text data and machine translation supervision.

Why It Matters

This research addresses a significant limitation in speech-to-speech translation systems, which often rely on scarce parallel datasets. By utilizing a novel approach that leverages text as a semantic bridge, RosettaSpeech enhances translation capabilities, particularly for languages with limited resources, thus broadening accessibility and usability in multilingual contexts.

Key Takeaways

RosettaSpeech achieves state-of-the-art zero-shot performance in speech-to-speech translation.
The model maintains the source speaker's voice without needing paired speech data.
It effectively scales to many-to-one translation, benefiting 'text-rich, speech-poor' languages.
Empirical evaluations show significant performance gains over leading baselines.
The approach could revolutionize translation technologies by reducing data bottlenecks.

Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2511.20974 (eess) [Submitted on 26 Nov 2025 (v1), last revised 15 Feb 2026 (this version, v2)] Title:RosettaSpeech: Zero-Shot Speech-to-Speech Translation without Parallel Speech Authors:Zhisheng Zheng, Xiaohang Sun, Tuan Dinh, Abhishek Yanamandra, Abhinav Jain, Zhu Liu, Sunil Hadap, Vimal Bhat, Manoj Aggarwal, Gerard Medioni, David Harwath View a PDF of the paper titled RosettaSpeech: Zero-Shot Speech-to-Speech Translation without Parallel Speech, by Zhisheng Zheng and 10 other authors View PDF HTML (experimental) Abstract:End-to-end speech-to-speech translation (S2ST) systems typically struggle with a critical data bottleneck: the scarcity of parallel speech-to-speech corpora. To overcome this, we introduce RosettaSpeech, a novel zero-shot framework trained exclusively on monolingual speech-text data augmented by machine translation supervision. Unlike prior works that rely on complex cascaded pseudo-labeling, our approach strategically utilizes text as a semantic bridge during training to synthesize translation targets, thereby eliminating the need for parallel speech pairs while maintaining a direct, end-to-end inference pipeline. Empirical evaluations on the CVSS-C benchmark demonstrate that RosettaSpeech achieves state-of-the-art zero-shot performance, surpassing leading baselines by significant margins - achieving ASR-BLEU scores of 25.17 for German-to-English (+27% relative gain) and 29....

Read Original Article

[2511.20974] RosettaSpeech: Zero-Shot Speech-to-Speech Translation without Parallel Speech

Summary

Why It Matters

Key Takeaways

Related Articles

UMKC Announces New Master of Science in Artificial Intelligence

Improving AI models’ ability to explain their predictions

[P] SpeakFlow - AI Dialogue Practice Coach with GLM 5.1

[R] ICML Anonymized git repos for rebuttal

No comments

Stay updated with AI News