[2601.08133] How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?
About this article
Abstract page for arXiv paper 2601.08133: How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?
Computer Science > Computer Vision and Pattern Recognition arXiv:2601.08133 (cs) [Submitted on 13 Jan 2026 (v1), last revised 2 Mar 2026 (this version, v2)] Title:How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation? Authors:Yujian Lee, Peng Gao, Yongqi Xu, Wentao Fan View a PDF of the paper titled How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?, by Yujian Lee and 3 other authors View PDF HTML (experimental) Abstract:Audio-visual semantic segmentation (AVSS) represents an extension of the audio-visual segmentation (AVS) task, necessitating a semantic understanding of audio-visual scenes beyond merely identifying sound-emitting objects at the visual pixel level. Contrary to a previous methodology, by decomposing the AVSS task into two discrete subtasks by initially providing a prompted segmentation mask to facilitate subsequent semantic analysis, our approach innovates on this foundational strategy. We introduce a novel collaborative framework, \textit{S}tepping \textit{S}tone \textit{P}lus (SSP), which integrates optical flow and textual prompts to assist the segmentation process. In scenarios where sound sources frequently coexist with moving objects, our pre-mask technique leverages optical flow to capture motion dynamics, providing essential temporal context for precise segmentation. To address the challenge posed by stationary sound-emitting objects, such as alarm clocks...