[2603.03241] UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?
About this article
Abstract page for arXiv paper 2603.03241: UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.03241 (cs) [Submitted on 3 Mar 2026] Title:UniG2U-Bench: Do Unified Models Advance Multimodal Understanding? Authors:Zimo Wen, Boxiu Li, Wanbo Zhang, Junxiang Lei, Xiaoyu Chen, Yijia Fan, Qi Zhang, Yujiang Wang, Lili Qiu, Bo Li, Ziwei Liu, Caihua Shan, Yifan Yang, Yifei Shen View a PDF of the paper titled UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?, by Zimo Wen and 13 other authors View PDF HTML (experimental) Abstract:Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3)...