[2509.24773] VSSFlow: Unifying Video-conditioned Sound and Speech

[2509.24773] VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning

arXiv - AI March 23, 2026 4 min read

About this article

Abstract page for arXiv paper 2509.24773: VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning

Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2509.24773 (eess) [Submitted on 29 Sep 2025 (v1), last revised 20 Mar 2026 (this version, v4)] Title:VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning Authors:Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, Ruihua Song View a PDF of the paper titled VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning, by Xin Cheng and 9 other authors View PDF HTML (experimental) Abstract:Video-conditioned audio generation, including Video-to-Sound (V2S) and Visual Text-to-Speech (VisualTTS), has traditionally been treated as distinct tasks, leaving the potential for a unified generative framework largely underexplored. In this paper, we bridge this gap with VSSFlow, a unified flow-matching framework that seamlessly solve both problems. To effectively handle multiple input signals within a Diffusion Transformer (DiT) architecture, we propose a disentangled condition aggregation mechanism leveraging distinct intrinsic properties of attention layers: cross-attention for semantic conditions, and self-attention for temporally-intensive conditions. Besides, contrary to the prevailing belief that joint training for the two tasks leads to performance degradation, we demonstrate that VSSFlow maintains superior performance during end-to-end joint learning process. Furthermore, we use a straightforward...

Originally published on March 23, 2026. Curated by AI News.

Machine Learning

Concerns About AI Model Capabilities Drive Down Cybersecurity Stocks

Concerns about the capabilities of an artificial intelligence (AI) model being tested by Anthropic drove down cybersecurity stocks on Fri...

AI Tools & Products · 4 min · 8 minutes ago

Llms

Meta is running intensive AI training weeks to get employees testing agents and coding with Claude

Meta's latest internal push are AI training weeks. CEO Mark Zuckerberg says 2026 is the year AI will "dramatically change" work at Meta.

AI Tools & Products · 5 min · 8 minutes ago

Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min · about 1 hour ago

Llms

[Project] PentaNet: Pushing beyond BitNet with Native Pentanary {-2, -1, 0, 1, 2} Quantization (124M, zero-multiplier inference)

Hey everyone, I've been experimenting with extreme LLM quantization following the BitNet 1.58b paper. While ternary quantization {-1, 0, ...

Reddit - Machine Learning · 1 min · about 2 hours ago

[2509.24773] VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning

About this article

Related Articles

Concerns About AI Model Capabilities Drive Down Cybersecurity Stocks

Meta is running intensive AI training weeks to get employees testing agents and coding with Claude

UMKC Announces New Master of Science in Artificial Intelligence

[Project] PentaNet: Pushing beyond BitNet with Native Pentanary {-2, -1, 0, 1, 2} Quantization (124M, zero-multiplier inference)

No comments

Stay updated with AI News