[2509.24773] VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning
About this article
Abstract page for arXiv paper 2509.24773: VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning
Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2509.24773 (eess) [Submitted on 29 Sep 2025 (v1), last revised 20 Mar 2026 (this version, v4)] Title:VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning Authors:Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, Ruihua Song View a PDF of the paper titled VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning, by Xin Cheng and 9 other authors View PDF HTML (experimental) Abstract:Video-conditioned audio generation, including Video-to-Sound (V2S) and Visual Text-to-Speech (VisualTTS), has traditionally been treated as distinct tasks, leaving the potential for a unified generative framework largely underexplored. In this paper, we bridge this gap with VSSFlow, a unified flow-matching framework that seamlessly solve both problems. To effectively handle multiple input signals within a Diffusion Transformer (DiT) architecture, we propose a disentangled condition aggregation mechanism leveraging distinct intrinsic properties of attention layers: cross-attention for semantic conditions, and self-attention for temporally-intensive conditions. Besides, contrary to the prevailing belief that joint training for the two tasks leads to performance degradation, we demonstrate that VSSFlow maintains superior performance during end-to-end joint learning process. Furthermore, we use a straightforward...