[2505.01448] OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models
About this article
Abstract page for arXiv paper 2505.01448: OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models
Computer Science > Machine Learning arXiv:2505.01448 (cs) [Submitted on 30 Apr 2025 (v1), last revised 30 Mar 2026 (this version, v2)] Title:OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models Authors:Shengkai Chen, Yifang Yin, Jinming Cao, Shili Xiang, Zhenguang Liu, Roger Zimmermann View a PDF of the paper titled OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models, by Shengkai Chen and 5 other authors View PDF HTML (experimental) Abstract:Audio-visual segmentation aims to separate sounding objects from videos by predicting pixel-level masks based on audio signals. Existing methods primarily concentrate on closed-set scenarios and direct audio-visual alignment and fusion, which limits their capability to generalize to new, unseen situations. In this paper, we propose OpenAVS, a novel training-free language-based approach that, for the first time, effectively aligns audio and visual modalities using text as a proxy for open-vocabulary Audio-Visual Segmentation (AVS). Equipped with multimedia foundation models, OpenAVS directly infers masks through 1) audio-to-text prompt generation, 2) LLM-guided prompt translation, and 3) text-to-visual sounding object segmentation. The objective of OpenAVS is to establish a simple yet flexible architecture that relies on the most appropriate foundation models by fully leveraging their capabilities to enable more effective knowledge transfer to the downstream AVS ...