[2603.07990] MJ1: Multimodal Judgment via Grounded Verification
About this article
Abstract page for arXiv paper 2603.07990: MJ1: Multimodal Judgment via Grounded Verification
Computer Science > Machine Learning arXiv:2603.07990 (cs) [Submitted on 9 Mar 2026 (v1), last revised 24 Mar 2026 (this version, v2)] Title:MJ1: Multimodal Judgment via Grounded Verification Authors:Bhavesh Kumar, Dylan Feng, Leonard Tang View a PDF of the paper titled MJ1: Multimodal Judgment via Grounded Verification, by Bhavesh Kumar and 2 other authors View PDF HTML (experimental) Abstract:Multimodal judges struggle to ground decisions in visual evidence. We present MJ1, a multimodal judge trained with reinforcement learning that enforces visual grounding through a structured grounded verification chain (observations $\rightarrow$ claims $\rightarrow$ verification $\rightarrow$ evaluation $\rightarrow$ scoring) and a counterfactual consistency reward that penalizes position bias. Even without training, our mechanism improves base-model accuracy on MMRB2 by +3.8 points on Image Editing and +1.7 on Multimodal Reasoning. After training, MJ1, with only 3B active parameters, achieves 77.0% accuracy on MMRB2 and surpasses orders-of-magnitude larger models like Gemini-3-Pro. These results show that grounded verification and consistency-based training substantially improve multimodal judgment without increasing model scale. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.07990 [cs.LG] (or arXiv:2603.07990v2 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.07990 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Leonard Tang [view...