[2509.06027] DreamAudio: Customized Text-to-Audio Generation with Diffusion Models
About this article
Abstract page for arXiv paper 2509.06027: DreamAudio: Customized Text-to-Audio Generation with Diffusion Models
Computer Science > Sound arXiv:2509.06027 (cs) [Submitted on 7 Sep 2025 (v1), last revised 24 Mar 2026 (this version, v2)] Title:DreamAudio: Customized Text-to-Audio Generation with Diffusion Models Authors:Yi Yuan, Xubo Liu, Haohe Liu, Xiyuan Kang, Zhuo Chen, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang View a PDF of the paper titled DreamAudio: Customized Text-to-Audio Generation with Diffusion Models, by Yi Yuan and 7 other authors View PDF HTML (experimental) Abstract:With the development of large-scale diffusion-based and language-modeling-based generative models, impressive progress has been achieved in text-to-audio generation. Despite producing high-quality outputs, existing text-to-audio models mainly aim to generate semantically aligned sound and fall short of controlling fine-grained acoustic characteristics of specific sounds. As a result, users who need specific sound content may find it difficult to generate the desired audio clips. In this paper, we present DreamAudio for customized text-to-audio generation (CTTA). Specifically, we introduce a new framework that is designed to enable the model to identify auditory information from user-provided reference concepts for audio generation. Given a few reference audio samples containing personalized audio events, our system can generate new audio samples that include these specific events. In addition, two types of datasets are developed for training and testing the proposed systems. The experiments show that DreamAu...