[2510.15148] XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
About this article
Abstract page for arXiv paper 2510.15148: XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
Computer Science > Computer Vision and Pattern Recognition arXiv:2510.15148 (cs) [Submitted on 16 Oct 2025 (v1), last revised 4 Apr 2026 (this version, v2)] Title:XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models Authors:Xingrui Wang, Jiang Liu, Chao Huang, Xiaodong Yu, Ze Wang, Ximeng Sun, Jialian Wu, Alan Yuille, Emad Barsoum, Zicheng Liu View a PDF of the paper titled XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models, by Xingrui Wang and 9 other authors View PDF HTML (experimental) Abstract:Omni-modal large language models (OLLMs) aim to unify audio, vision, and text understanding within a single framework. While existing benchmarks primarily evaluate general cross-modal question-answering ability, it remains unclear whether OLLMs achieve modality-invariant reasoning or exhibit modality-specific biases. We introduce XModBench, a large-scale tri-modal benchmark explicitly designed to measure cross-modal consistency. XModBench comprises 60,828 multiple-choice questions spanning five task families and systematically covers all six modality compositions in question-answer pairs, enabling fine-grained diagnosis of an OLLM's modality-invariant reasoning, modality disparity, and directional imbalance. Experiments show that even the strongest model, Gemini 2.5 Pro, (i) struggles with spatial and temporal reasoning, achieving less than 60% accuracy, (ii) reveals persistent modality disparities, with pe...