[2603.17246] On the Cone Effect and Modality Gap in Medical Vision-Language Embeddings
About this article
Abstract page for arXiv paper 2603.17246: On the Cone Effect and Modality Gap in Medical Vision-Language Embeddings
Computer Science > Machine Learning arXiv:2603.17246 (cs) [Submitted on 18 Mar 2026 (v1), last revised 20 Mar 2026 (this version, v2)] Title:On the Cone Effect and Modality Gap in Medical Vision-Language Embeddings Authors:David Restrepo, Miguel L Martins, Chenwei Wu, Luis Filipe Nakayama, Diego M Lopez, Stergios Christodoulidis, Maria Vakalopoulou, Enzo Ferrante View a PDF of the paper titled On the Cone Effect and Modality Gap in Medical Vision-Language Embeddings, by David Restrepo and 7 other authors View PDF HTML (experimental) Abstract:Vision-Language Models (VLMs) exhibit a characteristic "cone effect" in which nonlinear encoders map embeddings into highly concentrated regions of the representation space, contributing to cross-modal separation known as the modality gap. While this phenomenon has been widely observed, its practical impact on supervised multimodal learning -- particularly in medical domains -- remains unclear. In this work, we introduce a lightweight post-hoc mechanism that keeps pretrained VLM encoders frozen while continuously controlling cross-modal separation through a single hyperparameter {\lambda}. This enables systematic analysis of how the modality gap affects downstream multimodal performance without expensive retraining. We evaluate generalist (CLIP, SigLIP) and medically specialized (BioMedCLIP, MedSigLIP) models across diverse medical and natural datasets in a supervised multimodal settings. Results consistently show that reducing excessi...