[2509.14959] Discrete optimal transport is a strong audio adversarial attack
Summary
The paper introduces a novel method called discrete optimal transport voice conversion (kDOT-VC), demonstrating its effectiveness as an audio adversarial attack against anti-spoofing measures.
Why It Matters
This research is significant as it highlights vulnerabilities in audio anti-spoofing technologies, which are critical for security in voice recognition systems. Understanding these weaknesses can lead to improved defenses and more robust AI systems.
Key Takeaways
- kDOT-VC outperforms existing voice conversion methods in domain adaptation.
- The method serves as a black-box adversarial attack against audio anti-spoofing countermeasures.
- Distribution-level alignment is crucial for the stability and effectiveness of the attack.
- The research provides insights into the probabilistic nature of optimal transport in audio processing.
- Ablation analysis supports the robustness of the proposed attack method.
Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2509.14959 (eess) [Submitted on 18 Sep 2025 (v1), last revised 18 Feb 2026 (this version, v2)] Title:Discrete optimal transport is a strong audio adversarial attack Authors:Anton Selitskiy, Akib Shahriyar, Jishnuraj Prakasan View a PDF of the paper titled Discrete optimal transport is a strong audio adversarial attack, by Anton Selitskiy and 2 other authors View PDF HTML (experimental) Abstract:In this paper, we introduce the discrete optimal transport voice conversion ($k$DOT-VC) method. Comparison with $k$NN-VC, SinkVC, and Gaussian optimal transport (MKL) demonstrates stronger domain adaptation abilities of our method. We use the probabilistic nature of optimal transport (OT) and show that $k$DOT-VC is an effective black-box adversarial attack against modern audio anti-spoofing countermeasures (CMs). Our attack operates as a post-processing, distribution-alignment step: frame-level {WavLM} embeddings of generated speech are aligned to an unpaired bona fide pool via entropic OT and a top-$k$ barycentric projection, then decoded with a neural vocoder. Ablation analysis indicates that distribution-level alignment is a powerful and stable attack for deployed CMs. Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.14959 [eess.AS] (or arXiv:2509.14959v2 [eess.AS] for this version) https://doi.org/10.48550/arXiv.2509.14959 Focus to learn more arX...