[2505.04382] Discrete Optimal Transport and Voice Conversion
Summary
This paper introduces kDOT, a discrete optimal transport framework for voice conversion, demonstrating improved performance over traditional methods through a comprehensive analysis.
Why It Matters
The research addresses the growing need for effective voice conversion techniques in various applications, including security and accessibility. By enhancing the alignment of speaker embeddings, this study contributes to advancements in audio processing and highlights potential implications for spoof detection systems.
Key Takeaways
- kDOT framework improves voice conversion using discrete optimal transport.
- Outperforms traditional averaging methods in key performance metrics.
- Demonstrates strong domain adaptation capabilities in embedding space.
- Highlights security implications for spoof detection systems.
- Comprehensive ablation study reveals impact of utterance duration.
Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2505.04382 (eess) [Submitted on 7 May 2025 (v1), last revised 25 Feb 2026 (this version, v4)] Title:Discrete Optimal Transport and Voice Conversion Authors:Anton Selitskiy, Maitreya Kocharekar View a PDF of the paper titled Discrete Optimal Transport and Voice Conversion, by Anton Selitskiy and Maitreya Kocharekar View PDF HTML (experimental) Abstract:We propose kDOT, a discrete optimal transport (OT) framework for voice conversion (VC) operating in a pretrained speech embedding space. In contrast to the averaging strategies used in kNN-VC and SinkVC, and the independence assumption adopted in MKL, our method employs the barycentric projection of the discrete OT plan to construct a transport map between source and target speaker embedding distributions. We conduct a comprehensive ablation study over the number of transported embeddings and systematically analyze the impact of source and target utterance duration. Experiments on LibriSpeech demonstrate that OT with barycentric projection consistently improves distribution alignment and often outperforms averaging-based approaches in terms of WER, MOS, and FAD. Furthermore, we show that applying discrete OT as a post-processing step can transform spoofed speech into samples that are misclassified as bona fide by a state-of-the-art spoofing detector. This demonstrates the strong domain adaptation capability of OT in embedding space, while also reve...