[2603.28086] MOSS-VoiceGenerator: Create Realistic Voices with Natural Language Descriptions
About this article
Abstract page for arXiv paper 2603.28086: MOSS-VoiceGenerator: Create Realistic Voices with Natural Language Descriptions
Computer Science > Sound arXiv:2603.28086 (cs) [Submitted on 30 Mar 2026] Title:MOSS-VoiceGenerator: Create Realistic Voices with Natural Language Descriptions Authors:Kexin Huang, Liwei Fan, Botian Jiang, Yaozhou Jiang, Qian Tu, Jie Zhu, Yuqian Zhang, Yiwei Zhao, Chenchen Yang, Zhaoye Fei, Shimin Li, Xiaogui Yang, Qinyuan Cheng, Xipeng Qiu View a PDF of the paper titled MOSS-VoiceGenerator: Create Realistic Voices with Natural Language Descriptions, by Kexin Huang and 13 other authors View PDF HTML (experimental) Abstract:Voice design from natural language aims to generate speaker timbres directly from free-form textual descriptions, allowing users to create voices tailored to specific roles, personalities, and emotions. Such controllable voice creation benefits a wide range of downstream applications-including storytelling, game dubbing, role-play agents, and conversational assistants, making it a significant task for modern Text-to-Speech models. However, existing models are largely trained on carefully recorded studio data, which produces speech that is clean and well-articulated, yet lacks the lived-in qualities of real human voices. To address these limitations, we present MOSS-VoiceGenerator, an open-source instruction-driven voice generation model that creates new timbres directly from natural language prompts. Motivated by the hypothesis that exposure to real-world acoustic variation produces more perceptually natural voices, we train on large-scale expressive spe...