[2603.19274] CURE: A Multimodal Benchmark for Clinical Understanding and Retrieval Evaluation
About this article
Abstract page for arXiv paper 2603.19274: CURE: A Multimodal Benchmark for Clinical Understanding and Retrieval Evaluation
Computer Science > Computation and Language arXiv:2603.19274 (cs) [Submitted on 28 Feb 2026] Title:CURE: A Multimodal Benchmark for Clinical Understanding and Retrieval Evaluation Authors:Yannian Gu, Zhongzhen Huang, Linjie Mu, Xizhuo Zhang, Shaoting Zhang, Xiaofan Zhang View a PDF of the paper titled CURE: A Multimodal Benchmark for Clinical Understanding and Retrieval Evaluation, by Yannian Gu and Zhongzhen Huang and Linjie Mu and Xizhuo Zhang and Shaoting Zhang and Xiaofan Zhang View PDF HTML (experimental) Abstract:Multimodal large language models (MLLMs) demonstrate considerable potential in clinical diagnostics, a domain that inherently requires synthesizing complex visual and textual data alongside consulting authoritative medical literature. However, existing benchmarks primarily evaluate MLLMs in end-to-end answering scenarios. This limits the ability to disentangle a model's foundational multimodal reasoning from its proficiency in evidence retrieval and application. We introduce the Clinical Understanding and Retrieval Evaluation (CURE) benchmark. Comprising $500$ multimodal clinical cases mapped to physician-cited reference literature, CURE evaluates reasoning and retrieval under controlled evidence settings to disentangle their respective contributions. We evaluate state-of-the-art MLLMs across distinct evidence-gathering paradigms in both closed-ended and open-ended diagnosis tasks. Evaluations reveal a stark dichotomy: while advanced models demonstrate clini...