[2510.10689] OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

[2510.10689] OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

arXiv - AI 4 min read Article

Summary

The paper introduces OmniVideoBench, a benchmark designed to evaluate audio-visual understanding in multimodal large language models (MLLMs), addressing gaps in current evaluation methods.

Why It Matters

As MLLMs advance, effective evaluation of their audio-visual reasoning capabilities is crucial. OmniVideoBench aims to fill the gap in existing benchmarks by providing a comprehensive assessment framework that emphasizes logical consistency and modality complementarity, which is essential for developing more capable models.

Key Takeaways

  • OmniVideoBench provides a rigorous framework for evaluating audio-visual understanding in MLLMs.
  • The benchmark includes 1000 QA pairs and 13 question types, focusing on diverse reasoning challenges.
  • Evaluation results highlight a significant performance gap between open-source and closed-source models.
  • The benchmark aims to foster advancements in MLLMs with better reasoning capabilities.
  • Releasing OmniVideoBench will encourage further research and development in multimodal AI.

Computer Science > Artificial Intelligence arXiv:2510.10689 (cs) [Submitted on 12 Oct 2025 (v1), last revised 14 Feb 2026 (this version, v2)] Title:OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs Authors:Caorui Li, Yu Chen, Yiyan Ji, Jin Xu, Zhenyu Cui, Shihao Li, Yuanxing Zhang, Wentao Wang, Zhenghao Song, Dingling Zhang, Ying He, Haoxiang Liu, Yuxuan Wang, Qiufeng Wang, Jiafu Tang, Zhenhe Wu, Jiehui Luo, Zhiyu Pan, Weihao Xie, Chenchen Zhang, Zhaohui Wang, Jiayi Tian, Yanghai Wang, Zhe Cao, Minxin Dai, Ke Wang, Runzhe Wen, Yinghao Ma, Yaning Pan, Sungkyun Chang, Termeh Taheri, Haiwen Xia, Christos Plachouras, Emmanouil Benetos, Yizhi Li, Ge Zhang, Jian Yang, Tianhao Peng, Zili Wang, Minghao Liu, Junran Peng, Zhaoxiang Zhang, Jiaheng Liu View a PDF of the paper titled OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs, by Caorui Li and 42 other authors View PDF HTML (experimental) Abstract:Recent advances in multimodal large language models (MLLMs) have demonstrated substantial potential in video understanding. However, existing benchmarks fail to comprehensively evaluate synergistic reasoning capabilities across audio and visual modalities, often neglecting either one of the modalities or integrating them in a logically inconsistent manner. To bridge this gap, we introduce OmniVideoBench, a large-scale and rigorously designed benchmark dedicated to assessing synergistic audio-visual understanding, with a strong empha...

Related Articles

Llms

My AI spent last night modifying its own codebase

I've been working on a local AI system called Apis that runs completely offline through Ollama. During a background run, Apis identified ...

Reddit - Artificial Intelligence · 1 min ·
Llms

Fake users generated by AI can't simulate humans — review of 182 research papers. Your thoughts?

https://www.researchsquare.com/article/rs-9057643/v1 There’s a massive trend right now where tech companies, businesses, even researchers...

Reddit - Artificial Intelligence · 1 min ·
Llms

Depth-first pruning seems to transfer from GPT-2 to Llama (unexpectedly well)

TL;DR: Removing the right transformer layers (instead of shrinking all layers) gives smaller, faster models with minimal quality loss — a...

Reddit - Artificial Intelligence · 1 min ·
[2603.23966] Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage
Llms

[2603.23966] Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage

Abstract page for arXiv paper 2603.23966: Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage

arXiv - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime