[2505.04382] Discrete Optimal Transport and Voice Conversion

[2505.04382] Discrete Optimal Transport and Voice Conversion

arXiv - Machine Learning 4 min read Article

Summary

This paper introduces kDOT, a discrete optimal transport framework for voice conversion, demonstrating improved performance over traditional methods through a comprehensive analysis.

Why It Matters

The research addresses the growing need for effective voice conversion techniques in various applications, including security and accessibility. By enhancing the alignment of speaker embeddings, this study contributes to advancements in audio processing and highlights potential implications for spoof detection systems.

Key Takeaways

  • kDOT framework improves voice conversion using discrete optimal transport.
  • Outperforms traditional averaging methods in key performance metrics.
  • Demonstrates strong domain adaptation capabilities in embedding space.
  • Highlights security implications for spoof detection systems.
  • Comprehensive ablation study reveals impact of utterance duration.

Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2505.04382 (eess) [Submitted on 7 May 2025 (v1), last revised 25 Feb 2026 (this version, v4)] Title:Discrete Optimal Transport and Voice Conversion Authors:Anton Selitskiy, Maitreya Kocharekar View a PDF of the paper titled Discrete Optimal Transport and Voice Conversion, by Anton Selitskiy and Maitreya Kocharekar View PDF HTML (experimental) Abstract:We propose kDOT, a discrete optimal transport (OT) framework for voice conversion (VC) operating in a pretrained speech embedding space. In contrast to the averaging strategies used in kNN-VC and SinkVC, and the independence assumption adopted in MKL, our method employs the barycentric projection of the discrete OT plan to construct a transport map between source and target speaker embedding distributions. We conduct a comprehensive ablation study over the number of transported embeddings and systematically analyze the impact of source and target utterance duration. Experiments on LibriSpeech demonstrate that OT with barycentric projection consistently improves distribution alignment and often outperforms averaging-based approaches in terms of WER, MOS, and FAD. Furthermore, we show that applying discrete OT as a post-processing step can transform spoofed speech into samples that are misclassified as bona fide by a state-of-the-art spoofing detector. This demonstrates the strong domain adaptation capability of OT in embedding space, while also reve...

Related Articles

Llms

Is the Mirage Effect a bug, or is it Geometric Reconstruction in action? A framework for why VLMs perform better "hallucinating" than guessing, and what that may tell us about what's really inside these models

Last week, a team from Stanford and UCSF (Asadi, O'Sullivan, Fei-Fei Li, Euan Ashley et al.) dropped two companion papers. The first, MAR...

Reddit - Artificial Intelligence · 1 min ·
The Galaxy S26’s photo app can sloppify your memories | The Verge
Nlp

The Galaxy S26’s photo app can sloppify your memories | The Verge

Samsung’s S26 series offers some new AI photo editing capabilities to transform your photos. But where’s the line between acceptable edit...

The Verge - AI · 8 min ·
Llms

[D] The problem with comparing AI memory system benchmarks — different evaluation methods make scores meaningless

I've been reviewing how various AI memory systems evaluate their performance and noticed a fundamental issue with cross-system comparison...

Reddit - Machine Learning · 1 min ·
Machine Learning

[D] I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Machine Learning · 1 min ·
More in Nlp: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime