[2510.10689] OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
Summary
The paper introduces OmniVideoBench, a benchmark designed to evaluate audio-visual understanding in multimodal large language models (MLLMs), addressing gaps in current evaluation methods.
Why It Matters
As MLLMs advance, effective evaluation of their audio-visual reasoning capabilities is crucial. OmniVideoBench aims to fill the gap in existing benchmarks by providing a comprehensive assessment framework that emphasizes logical consistency and modality complementarity, which is essential for developing more capable models.
Key Takeaways
- OmniVideoBench provides a rigorous framework for evaluating audio-visual understanding in MLLMs.
- The benchmark includes 1000 QA pairs and 13 question types, focusing on diverse reasoning challenges.
- Evaluation results highlight a significant performance gap between open-source and closed-source models.
- The benchmark aims to foster advancements in MLLMs with better reasoning capabilities.
- Releasing OmniVideoBench will encourage further research and development in multimodal AI.
Computer Science > Artificial Intelligence arXiv:2510.10689 (cs) [Submitted on 12 Oct 2025 (v1), last revised 14 Feb 2026 (this version, v2)] Title:OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs Authors:Caorui Li, Yu Chen, Yiyan Ji, Jin Xu, Zhenyu Cui, Shihao Li, Yuanxing Zhang, Wentao Wang, Zhenghao Song, Dingling Zhang, Ying He, Haoxiang Liu, Yuxuan Wang, Qiufeng Wang, Jiafu Tang, Zhenhe Wu, Jiehui Luo, Zhiyu Pan, Weihao Xie, Chenchen Zhang, Zhaohui Wang, Jiayi Tian, Yanghai Wang, Zhe Cao, Minxin Dai, Ke Wang, Runzhe Wen, Yinghao Ma, Yaning Pan, Sungkyun Chang, Termeh Taheri, Haiwen Xia, Christos Plachouras, Emmanouil Benetos, Yizhi Li, Ge Zhang, Jian Yang, Tianhao Peng, Zili Wang, Minghao Liu, Junran Peng, Zhaoxiang Zhang, Jiaheng Liu View a PDF of the paper titled OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs, by Caorui Li and 42 other authors View PDF HTML (experimental) Abstract:Recent advances in multimodal large language models (MLLMs) have demonstrated substantial potential in video understanding. However, existing benchmarks fail to comprehensively evaluate synergistic reasoning capabilities across audio and visual modalities, often neglecting either one of the modalities or integrating them in a logically inconsistent manner. To bridge this gap, we introduce OmniVideoBench, a large-scale and rigorously designed benchmark dedicated to assessing synergistic audio-visual understanding, with a strong empha...