Machine Learning Generative Ai Nlp

[2506.05688] Voice Impression Control in Zero-Shot TTS

arXiv - Machine Learning February 19, 2026 3 min read Article

Summary

This paper presents a novel method for controlling voice impressions in zero-shot text-to-speech (TTS) systems, utilizing a low-dimensional vector to modulate para-/non-linguistic features effectively.

Why It Matters

The ability to control voice impressions in TTS systems enhances user experience and personalization in applications like virtual assistants and audiobooks. This research addresses a significant gap in TTS technology, enabling more nuanced and expressive speech synthesis without extensive manual tuning.

Key Takeaways

Introduces a method for voice impression control in zero-shot TTS.
Utilizes a low-dimensional vector to represent voice impression pairs.
Demonstrates effectiveness through objective and subjective evaluations.
Generates impression vectors from natural language descriptions using large language models.
Eliminates the need for manual optimization in TTS systems.

Computer Science > Sound arXiv:2506.05688 (cs) [Submitted on 6 Jun 2025 (v1), last revised 18 Feb 2026 (this version, v3)] Title:Voice Impression Control in Zero-Shot TTS Authors:Kenichi Fujita, Shota Horiguchi, Yusuke Ijima View a PDF of the paper titled Voice Impression Control in Zero-Shot TTS, by Kenichi Fujita and 2 other authors View PDF HTML (experimental) Abstract:Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization. Audio examples are available on our demo page (this https URL). Comments: Subjects: Sound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) Cite as: arXiv:2506.05688 [cs.SD] (or arXiv:2506.05688v3 [cs.SD] for this versi...

Read Original Article

[2506.05688] Voice Impression Control in Zero-Shot TTS

Summary

Why It Matters

Key Takeaways

Related Articles

[P] MCGrad: fix calibration of your ML model in subgroups

Ml project user give dataset and I give best model [D] [P]

[D] ICML Reviewer Acknowledgement

Claude Opus 4.6 API at 40% below Anthropic pricing – try free before you pay anything

No comments

Stay updated with AI News