[2506.05688] Voice Impression Control in Zero-Shot TTS

[2506.05688] Voice Impression Control in Zero-Shot TTS

arXiv - Machine Learning 3 min read Article

Summary

This paper presents a novel method for controlling voice impressions in zero-shot text-to-speech (TTS) systems, utilizing a low-dimensional vector to modulate para-/non-linguistic features effectively.

Why It Matters

The ability to control voice impressions in TTS systems enhances user experience and personalization in applications like virtual assistants and audiobooks. This research addresses a significant gap in TTS technology, enabling more nuanced and expressive speech synthesis without extensive manual tuning.

Key Takeaways

  • Introduces a method for voice impression control in zero-shot TTS.
  • Utilizes a low-dimensional vector to represent voice impression pairs.
  • Demonstrates effectiveness through objective and subjective evaluations.
  • Generates impression vectors from natural language descriptions using large language models.
  • Eliminates the need for manual optimization in TTS systems.

Computer Science > Sound arXiv:2506.05688 (cs) [Submitted on 6 Jun 2025 (v1), last revised 18 Feb 2026 (this version, v3)] Title:Voice Impression Control in Zero-Shot TTS Authors:Kenichi Fujita, Shota Horiguchi, Yusuke Ijima View a PDF of the paper titled Voice Impression Control in Zero-Shot TTS, by Kenichi Fujita and 2 other authors View PDF HTML (experimental) Abstract:Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization. Audio examples are available on our demo page (this https URL). Comments: Subjects: Sound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) Cite as: arXiv:2506.05688 [cs.SD]   (or arXiv:2506.05688v3 [cs.SD] for this versi...

Related Articles

Machine Learning

[P] MCGrad: fix calibration of your ML model in subgroups

Hi r/MachineLearning, We’re open-sourcing MCGrad, a Python package for multicalibration–developed and deployed in production at Meta. Thi...

Reddit - Machine Learning · 1 min ·
Machine Learning

Ml project user give dataset and I give best model [D] [P]

Tl,dr : suggest me a solution to create a ai ml project where user will give his dataset as input and the project should give best model ...

Reddit - Machine Learning · 1 min ·
Machine Learning

[D] ICML Reviewer Acknowledgement

Hi, I'm a little confused about ICML discussion period Does the period for reviewer acknowledging responses have already ended? One of th...

Reddit - Machine Learning · 1 min ·
Llms

Claude Opus 4.6 API at 40% below Anthropic pricing – try free before you pay anything

Hey everyone I've set up a self-hosted API gateway using [New-API](QuantumNous/new-ap) to manage and distribute Claude Opus 4.6 access ac...

Reddit - Artificial Intelligence · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime