[2602.16019] MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval
Summary
The paper presents MedProbCLIP, a probabilistic framework for enhancing the reliability of radiograph-report retrieval using vision-language models, outperforming existing methods.
Why It Matters
This research addresses the critical need for reliable image-text retrieval systems in healthcare, particularly in radiology, where accuracy and trustworthiness are paramount. By introducing a probabilistic approach, the authors aim to improve clinical outcomes and reduce risks associated with misinterpretations.
Key Takeaways
- MedProbCLIP utilizes probabilistic embeddings to enhance reliability in radiology report retrieval.
- The framework outperforms existing deterministic models in accuracy and robustness.
- Incorporates multi-view and multi-section encoding for improved clinical alignment.
- Demonstrates superior calibration and risk-coverage behavior.
- Addresses the need for trustworthy AI applications in high-stakes biomedical contexts.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.16019 (cs) [Submitted on 17 Feb 2026] Title:MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval Authors:Ahmad Elallaf, Yu Zhang, Yuktha Priya Masupalli, Jeong Yang, Young Lee, Zechun Cao, Gongbo Liang View a PDF of the paper titled MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval, by Ahmad Elallaf and 6 other authors View PDF HTML (experimental) Abstract:Vision-language foundation models have emerged as powerful general-purpose representation learners with strong potential for multimodal understanding, but their deterministic embeddings often fail to provide the reliability required for high-stakes biomedical applications. This work introduces MedProbCLIP, a probabilistic vision-language learning framework for chest X-ray and radiology report representation learning and bidirectional retrieval. MedProbCLIP models image and text representations as Gaussian embeddings through a probabilistic contrastive objective that explicitly captures uncertainty and many-to-many correspondences between radiographs and clinical narratives. A variational information bottleneck mitigates overconfident predictions, while MedProbCLIP employs multi-view radiograph encoding and multi-section report encoding during training to provide fine-grained supervision for clinically aligned correspondence, ye...