[2603.00546] Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation
About this article
Abstract page for arXiv paper 2603.00546: Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation
Computer Science > Artificial Intelligence arXiv:2603.00546 (cs) [Submitted on 28 Feb 2026] Title:Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation Authors:Zeyu Chen, Huanjin Yao, Ziwang Zhao, Min Yang View a PDF of the paper titled Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation, by Zeyu Chen and 3 other authors View PDF HTML (experimental) Abstract:Using Multimodal Large Language Models (MLLMs) as judges to achieve precise and consistent evaluations has gradually become an emerging paradigm across various domains. Evaluating the capability and reliability of MLLM-as-a-judge systems is therefore essential for ensuring trustworthy assessment. Existing judge benchmarks categorize samples by task types but fail to capture the fundamental judgment capabilities required for reliable evaluation. In this work, we introduce M-JudgeBench, a ten-dimensional capability-oriented benchmark designed to comprehensively assess the judgment abilities of MLLMs. Our benchmark decomposes evaluation into pairwise Chain-of-Thought (CoT) comparison, length bias avoidance, and process error detection tasks, jointly covering ten fine-grained subtasks. This design enables diagnosis of model reliability across reasoning styles, response lengths, and cross-model variations. Systematic evaluation uncovers the systematic weaknesses in existing MLLM-as-a-judge systems. To address this issue...