[2509.19668] Selective Classifier-free Guidance for Zero-shot Text-to-speech
About this article
Abstract page for arXiv paper 2509.19668: Selective Classifier-free Guidance for Zero-shot Text-to-speech
Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2509.19668 (eess) [Submitted on 24 Sep 2025 (v1), last revised 24 Mar 2026 (this version, v2)] Title:Selective Classifier-free Guidance for Zero-shot Text-to-speech Authors:John Zheng, Farhad Maleki View a PDF of the paper titled Selective Classifier-free Guidance for Zero-shot Text-to-speech, by John Zheng and 1 other authors View PDF HTML (experimental) Abstract:In zero-shot text-to-speech, achieving a balance between fidelity to the target speaker and adherence to text content remains a challenge. While classifier-free guidance (CFG) strategies have shown promising results in image generation, their application to speech synthesis are underexplored. Separating the conditions used for CFG enables trade-offs between different desired characteristics in speech synthesis. In this paper, we evaluate the adaptability of CFG strategies originally developed for image generation to speech synthesis and extend separated-condition CFG approaches for this domain. Our results show that CFG strategies effective in image generation generally fail to improve speech synthesis. We also find that we can improve speaker similarity while limiting degradation of text adherence by applying standard CFG during early timesteps and switching to selective CFG only in later timesteps. Surprisingly, we observe that the effectiveness of a selective CFG strategy is highly text-representation dependent, as differences betwe...