[2511.20974] RosettaSpeech: Zero-Shot Speech-to-Speech Translation without Parallel Speech

[2511.20974] RosettaSpeech: Zero-Shot Speech-to-Speech Translation without Parallel Speech

arXiv - Machine Learning 4 min read Article

Summary

RosettaSpeech introduces a zero-shot framework for speech-to-speech translation, overcoming the need for parallel speech data by using monolingual speech-text data and machine translation supervision.

Why It Matters

This research addresses a significant limitation in speech-to-speech translation systems, which often rely on scarce parallel datasets. By utilizing a novel approach that leverages text as a semantic bridge, RosettaSpeech enhances translation capabilities, particularly for languages with limited resources, thus broadening accessibility and usability in multilingual contexts.

Key Takeaways

  • RosettaSpeech achieves state-of-the-art zero-shot performance in speech-to-speech translation.
  • The model maintains the source speaker's voice without needing paired speech data.
  • It effectively scales to many-to-one translation, benefiting 'text-rich, speech-poor' languages.
  • Empirical evaluations show significant performance gains over leading baselines.
  • The approach could revolutionize translation technologies by reducing data bottlenecks.

Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2511.20974 (eess) [Submitted on 26 Nov 2025 (v1), last revised 15 Feb 2026 (this version, v2)] Title:RosettaSpeech: Zero-Shot Speech-to-Speech Translation without Parallel Speech Authors:Zhisheng Zheng, Xiaohang Sun, Tuan Dinh, Abhishek Yanamandra, Abhinav Jain, Zhu Liu, Sunil Hadap, Vimal Bhat, Manoj Aggarwal, Gerard Medioni, David Harwath View a PDF of the paper titled RosettaSpeech: Zero-Shot Speech-to-Speech Translation without Parallel Speech, by Zhisheng Zheng and 10 other authors View PDF HTML (experimental) Abstract:End-to-end speech-to-speech translation (S2ST) systems typically struggle with a critical data bottleneck: the scarcity of parallel speech-to-speech corpora. To overcome this, we introduce RosettaSpeech, a novel zero-shot framework trained exclusively on monolingual speech-text data augmented by machine translation supervision. Unlike prior works that rely on complex cascaded pseudo-labeling, our approach strategically utilizes text as a semantic bridge during training to synthesize translation targets, thereby eliminating the need for parallel speech pairs while maintaining a direct, end-to-end inference pipeline. Empirical evaluations on the CVSS-C benchmark demonstrate that RosettaSpeech achieves state-of-the-art zero-shot performance, surpassing leading baselines by significant margins - achieving ASR-BLEU scores of 25.17 for German-to-English (+27% relative gain) and 29....

Related Articles

UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
Improving AI models’ ability to explain their predictions
Machine Learning

Improving AI models’ ability to explain their predictions

AI News - General · 9 min ·
Machine Learning

[P] SpeakFlow - AI Dialogue Practice Coach with GLM 5.1

Built SpeakFlow for the Z.AI Builder Series hackathon. AI dialogue practice coach that evaluates your spoken responses in real-time. Two ...

Reddit - Machine Learning · 1 min ·
Machine Learning

[R] ICML Anonymized git repos for rebuttal

A number of the papers I'm reviewing for have submitted additional figures and code through anonymized git repos (e.g. https://anonymous....

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime