[2512.02650] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
About this article
Abstract page for arXiv paper 2512.02650: Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
Computer Science > Computer Vision and Pattern Recognition arXiv:2512.02650 (cs) [Submitted on 2 Dec 2025 (v1), last revised 27 Mar 2026 (this version, v2)] Title:Hear What Matters! Text-conditioned Selective Video-to-Audio Generation Authors:Junwon Lee, Juhan Nam, Jiyoung Lee View a PDF of the paper titled Hear What Matters! Text-conditioned Selective Video-to-Audio Generation, by Junwon Lee and 2 other authors View PDF HTML (experimental) Abstract:This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. We propose SELVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector to distinctly extract prompt-relevant sound-source visual features from the video encoder. To suppress text-irrelevant activations with efficient video encoder finetuning, the proposed supplementary tokens promote cross-attention to yield robust semantic and temporal grounding. SELVA further employs an autonomous video-mixing scheme in a self-supervised manner to overcome the lack of mono audio track supervision. We evaluate SELVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness...