[2506.05688] Voice Impression Control in Zero-Shot TTS
Summary
This paper presents a novel method for controlling voice impressions in zero-shot text-to-speech (TTS) systems, utilizing a low-dimensional vector to modulate para-/non-linguistic features effectively.
Why It Matters
The ability to control voice impressions in TTS systems enhances user experience and personalization in applications like virtual assistants and audiobooks. This research addresses a significant gap in TTS technology, enabling more nuanced and expressive speech synthesis without extensive manual tuning.
Key Takeaways
- Introduces a method for voice impression control in zero-shot TTS.
- Utilizes a low-dimensional vector to represent voice impression pairs.
- Demonstrates effectiveness through objective and subjective evaluations.
- Generates impression vectors from natural language descriptions using large language models.
- Eliminates the need for manual optimization in TTS systems.
Computer Science > Sound arXiv:2506.05688 (cs) [Submitted on 6 Jun 2025 (v1), last revised 18 Feb 2026 (this version, v3)] Title:Voice Impression Control in Zero-Shot TTS Authors:Kenichi Fujita, Shota Horiguchi, Yusuke Ijima View a PDF of the paper titled Voice Impression Control in Zero-Shot TTS, by Kenichi Fujita and 2 other authors View PDF HTML (experimental) Abstract:Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization. Audio examples are available on our demo page (this https URL). Comments: Subjects: Sound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) Cite as: arXiv:2506.05688 [cs.SD] (or arXiv:2506.05688v3 [cs.SD] for this versi...