[2603.04128] Crab$^{+}$: A Scalable and Unified Audio-Visual Scene

[2603.04128] Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

arXiv - AI March 05, 2026 4 min read

About this article

Abstract page for arXiv paper 2603.04128: Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Computer Science > Computer Vision and Pattern Recognition arXiv:2603.04128 (cs) [Submitted on 4 Mar 2026] Title:Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation Authors:Dongnuan Cai, Henghui Du, Chang Zhou, Xi Chen, Dan Guo, Hongyuan Zhang, Xuelong Li, Di Hu View a PDF of the paper titled Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation, by Dongnuan Cai and 7 other authors View PDF HTML (experimental) Abstract:Developing Audio-Visual Large Language Models (AV-LLMs) for unified scene understanding is pivotal in multimodal intelligence. While instruction tuning enables pre-trained models with multi-task abilities, we observe that conventional multi-task unification methods often suffer from severe negative transfer, where nearly 55% of tasks degrade compared to single-task training. We attribute this phenomenon to audio-visual task heterogeneity, characterized by disparate task granularity and divergent capability demands, which lead to negative interference under joint training. To tackle this, we present Crab$^{+}$, a scalable and unified audio-visual scene understanding model that addresses task heterogeneity through explicit cooperation from both data and model perspectives. On the data side, we introduce AV-UIE v2, a comprehensive Audio-Visual Unified Instruction-tuning dataset with Explicit reasoning processes. It contains approximately 222K samples spanning 17 datase...

Originally published on March 05, 2026. Curated by AI News.

Llms

Google’s Gemini AI can answer your questions with 3D models and simulations

Google's latest upgrade for Gemini will allow the chatbot to generate interactive 3D models and simulations in response to your questions...

The Verge - AI · 4 min · about 2 hours ago

Llms

Moody’s Integrates AI Agents With Anthropic’s Claude

AI Tools & Products · 4 min · about 2 hours ago

Llms

AI on the couch: Anthropic gives Claude 20 hours of psychiatry

AI Tools & Products · 6 min · about 2 hours ago

Llms

These AI Glasses Switch Between ChatGPT and Gemini. Why Don't More Wearables Do This?

AI Tools & Products · 6 min · about 2 hours ago

[2603.04128] Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

About this article

Related Articles

Google’s Gemini AI can answer your questions with 3D models and simulations

Moody’s Integrates AI Agents With Anthropic’s Claude

AI on the couch: Anthropic gives Claude 20 hours of psychiatry

These AI Glasses Switch Between ChatGPT and Gemini. Why Don't More Wearables Do This?

No comments

Stay updated with AI News