[2603.04128] Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

[2603.04128] Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

arXiv - AI 4 min read

About this article

Abstract page for arXiv paper 2603.04128: Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Computer Science > Computer Vision and Pattern Recognition arXiv:2603.04128 (cs) [Submitted on 4 Mar 2026] Title:Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation Authors:Dongnuan Cai, Henghui Du, Chang Zhou, Xi Chen, Dan Guo, Hongyuan Zhang, Xuelong Li, Di Hu View a PDF of the paper titled Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation, by Dongnuan Cai and 7 other authors View PDF HTML (experimental) Abstract:Developing Audio-Visual Large Language Models (AV-LLMs) for unified scene understanding is pivotal in multimodal intelligence. While instruction tuning enables pre-trained models with multi-task abilities, we observe that conventional multi-task unification methods often suffer from severe negative transfer, where nearly 55% of tasks degrade compared to single-task training. We attribute this phenomenon to audio-visual task heterogeneity, characterized by disparate task granularity and divergent capability demands, which lead to negative interference under joint training. To tackle this, we present Crab$^{+}$, a scalable and unified audio-visual scene understanding model that addresses task heterogeneity through explicit cooperation from both data and model perspectives. On the data side, we introduce AV-UIE v2, a comprehensive Audio-Visual Unified Instruction-tuning dataset with Explicit reasoning processes. It contains approximately 222K samples spanning 17 datase...

Originally published on March 05, 2026. Curated by AI News.

Related Articles

Google’s Gemini AI can answer your questions with 3D models and simulations
Llms

Google’s Gemini AI can answer your questions with 3D models and simulations

Google's latest upgrade for Gemini will allow the chatbot to generate interactive 3D models and simulations in response to your questions...

The Verge - AI · 4 min ·
Moody’s Integrates AI Agents With Anthropic’s Claude
Llms

Moody’s Integrates AI Agents With Anthropic’s Claude

AI Tools & Products · 4 min ·
AI on the couch: Anthropic gives Claude 20 hours of psychiatry
Llms

AI on the couch: Anthropic gives Claude 20 hours of psychiatry

AI Tools & Products · 6 min ·
These AI Glasses Switch Between ChatGPT and Gemini. Why Don't More Wearables Do This?
Llms

These AI Glasses Switch Between ChatGPT and Gemini. Why Don't More Wearables Do This?

AI Tools & Products · 6 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime