[2602.13650] KorMedMCQA-V: A Multimodal Benchmark for Evaluating Vision-Language Models on the Korean Medical Licensing Examination
Summary
The article presents KorMedMCQA-V, a benchmark dataset for evaluating vision-language models on the Korean Medical Licensing Examination, featuring 1,534 questions and 2,043 images across various clinical modalities.
Why It Matters
This benchmark is significant for advancing the evaluation of vision-language models in the medical field, particularly in Korea. It provides a structured way to assess model performance on multimodal tasks, which is crucial for improving AI applications in healthcare.
Key Takeaways
- KorMedMCQA-V includes 1,534 questions and 2,043 images from Korean medical exams.
- The dataset evaluates over 50 vision-language models under a unified zero-shot protocol.
- Performance varies significantly across imaging modalities and multi-image questions.
- Reasoning-oriented models outperform instruction-tuned variants by up to 20 percentage points.
- The dataset complements existing text-only benchmarks for a comprehensive evaluation suite.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.13650 (cs) [Submitted on 14 Feb 2026] Title:KorMedMCQA-V: A Multimodal Benchmark for Evaluating Vision-Language Models on the Korean Medical Licensing Examination Authors:Byungjin Choi, Seongsu Bae, Sunjun Kweon, Edward Choi View a PDF of the paper titled KorMedMCQA-V: A Multimodal Benchmark for Evaluating Vision-Language Models on the Korean Medical Licensing Examination, by Byungjin Choi and 2 other authors View PDF HTML (experimental) Abstract:We introduce KorMedMCQA-V, a Korean medical licensing-exam-style multimodal multiple-choice question answering benchmark for evaluating vision-language models (VLMs). The dataset consists of 1,534 questions with 2,043 associated images from Korean Medical Licensing Examinations (2012-2023), with about 30% containing multiple images requiring cross-image evidence integration. Images cover clinical modalities including X-ray, computed tomography (CT), electrocardiography (ECG), ultrasound, endoscopy, and other medical visuals. We benchmark over 50 VLMs across proprietary and open-source categories-spanning general-purpose, medical-specialized, and Korean-specialized families-under a unified zero-shot evaluation protocol. The best proprietary model (Gemini-3.0-Pro) achieves 96.9% accuracy, the best open-source model (Qwen3-VL-32B-Thinking) 83.7%, and the best Korean-specialized model (VARCO-VISION-2.0-14B) only 43.2%. We further find that reasoning-oriented model v...