[2502.04326] WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
About this article
Abstract page for arXiv paper 2502.04326: WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
Computer Science > Computer Vision and Pattern Recognition arXiv:2502.04326 (cs) [Submitted on 6 Feb 2025 (v1), last revised 1 Mar 2026 (this version, v3)] Title:WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs Authors:Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie View a PDF of the paper titled WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs, by Jack Hong and 5 other authors View PDF HTML (experimental) Abstract:We introduce WorldSense, the first benchmark to assess the multi-modal video understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast to existing benchmarks, our WorldSense has several features: (i)collaboration of omni-modality, we design the evaluation tasks to feature a strong coupling of audio and video, requiring models to effectively utilize the synergistic perception of omni-modality; (ii)diversity of videos and tasks, WorldSense encompasses a diverse collection of 1,662 audio-visual synchronised videos, systematically categorized into 8 primary domains and 67 fine-grained subcategories to cover the broad scenarios, and 3,172 multi-choice QA pairs across 26 distinct tasks to enable the comprehensive evaluation; (iii)high-quality annotations, all the QA pairs are manually labeled by 80 expert annotators with multiple rounds of correction to ensure quality. Based on our WorldSense, we extensively evaluate various state-of-the-art models. The expe...