[2602.23652] 3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection
About this article
Abstract page for arXiv paper 2602.23652: 3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.23652 (cs) [Submitted on 27 Feb 2026] Title:3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection Authors:Haowen Zhu, Ning Yin, Xiaogen Zhou View a PDF of the paper titled 3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection, by Haowen Zhu and 2 other authors View PDF HTML (experimental) Abstract:Vision-language models (VLMs) show strong potential for complex diagnostic tasks in medical imaging. However, applying VLMs to multi-organ medical imaging introduces two principal challenges: (1) modality-specific vision-language alignment and (2) cross-modal feature fusion. In this work, we propose MedMAP, a Medical Modality-Aware Pretraining framework that enhances vision-language representation learning in 3D MRI. MedMAP comprises a modality-aware vision-language alignment stage and a fine-tuning stage for multi-organ abnormality detection. During the pre-training stage, the modality-aware encoders implicitly capture the joint modality distribution and improve alignment between visual and textual representations. We then fine-tune the pre-trained vision encoders (while keeping the text encoder frozen) for downstream tasks. To this end, we curated MedMoM-MRI3D, comprising 7,392 3D MRI volume-report pairs spanning twelve MRI modalities and nine abnormalities tailored for various 3D medical analysis tasks. Extensive experime...