[2503.10522] AudioX: A Unified Framework for Anything-to-Audio Generation
Summary
AudioX presents a unified framework for generating audio from various multimodal inputs, enhancing the quality and flexibility of audio generation through a novel adaptive fusion module.
Why It Matters
This research addresses significant challenges in audio generation by integrating multiple input modalities, which can lead to advancements in fields like music production, sound design, and AI-driven audio applications. The creation of a large-scale dataset further supports the development of robust models in this area.
Key Takeaways
- AudioX integrates diverse multimodal inputs for audio generation.
- The framework includes a Multimodal Adaptive Fusion module for improved cross-modal alignment.
- A new dataset, IF-caps, contains over 7 million samples for training.
- AudioX outperforms existing methods in text-to-audio and text-to-music tasks.
- The research highlights the potential for powerful instruction-following capabilities in audio generation.
Computer Science > Multimedia arXiv:2503.10522 (cs) [Submitted on 13 Mar 2025 (v1), last revised 14 Feb 2026 (this version, v3)] Title:AudioX: A Unified Framework for Anything-to-Audio Generation Authors:Zeyue Tian, Zhaoyang Liu, Yizhu Jin, Ruibin Yuan, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo View a PDF of the paper titled AudioX: A Unified Framework for Anything-to-Audio Generation, by Zeyue Tian and 7 other authors View PDF HTML (experimental) Abstract:Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, and 2) large-scale, high-quality training data. As such, we propose AudioX, a unified framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. The core design in this framework is a Multimodal Adaptive Fusion module, which enables the effective fusion of diverse multimodal inputs, enhancing cross-modal alignment and improving overall generation quality. To train this unified model, we construct a large-scale, high-quality dataset, IF-caps, comprising over 7 million samples curated through a structured data annotation pipeline. This dataset provides comprehensive supervision for multimodal-conditioned audio generation. We benchmark AudioX against state-of-the-art methods across a wide range of tasks, finding that our model achieves superior performance, especially ...