[2602.17689] Robust Pre-Training of Medical Vision-and-Language Models with Domain-Invariant Multi-Modal Masked Reconstruction
Summary
This article presents Robust Multi-Modal Masked Reconstruction (Robust-MMR), a novel self-supervised pre-training framework for medical vision-language models that enhances robustness against domain shifts, achieving improved performance across various benchmarks.
Why It Matters
The research addresses a critical gap in the robustness of medical vision-language models, which are essential for accurate clinical reasoning. By improving model performance under varying conditions, this work has significant implications for real-world medical applications, potentially enhancing diagnostic accuracy and patient outcomes.
Key Takeaways
- Robust-MMR incorporates robustness objectives into pre-training for medical models.
- The framework shows improved accuracy in cross-domain medical tasks.
- Domain-invariant representations enhance model reliability for clinical applications.
- Robust-MMR outperforms existing methods in various medical benchmarks.
- The study highlights the importance of robustness in AI for healthcare.
Computer Science > Machine Learning arXiv:2602.17689 (cs) [Submitted on 6 Feb 2026] Title:Robust Pre-Training of Medical Vision-and-Language Models with Domain-Invariant Multi-Modal Masked Reconstruction Authors:Melika Filvantorkaman, Mohsen Piri View a PDF of the paper titled Robust Pre-Training of Medical Vision-and-Language Models with Domain-Invariant Multi-Modal Masked Reconstruction, by Melika Filvantorkaman and 1 other authors View PDF Abstract:Medical vision-language models show strong potential for joint reasoning over medical images and clinical text, but their performance often degrades under domain shift caused by variations in imaging devices, acquisition protocols, and reporting styles. Existing multi-modal pre-training methods largely overlook robustness, treating it as a downstream adaptation problem. In this work, we propose Robust Multi-Modal Masked Reconstruction (Robust-MMR), a self-supervised pre-training framework that explicitly incorporates robustness objectives into masked vision-language learning. Robust-MMR integrates asymmetric perturbation-aware masking, domain-consistency regularization, and modality-resilience constraints to encourage domain-invariant representations. We evaluate Robust-MMR on multiple medical vision-language benchmarks, including medical visual question answering (VQA-RAD, SLAKE, VQA-2019), cross-domain image-text classification (MELINDA), and robust image-caption retrieval (ROCO). Robust-MMR achieves 78.9% cross-domain accu...