[2507.22264] SmartCLIP: Modular Vision-language Alignment with Identification Guarantees
About this article
Abstract page for arXiv paper 2507.22264: SmartCLIP: Modular Vision-language Alignment with Identification Guarantees
Computer Science > Computer Vision and Pattern Recognition arXiv:2507.22264 (cs) [Submitted on 29 Jul 2025 (v1), last revised 3 Apr 2026 (this version, v2)] Title:SmartCLIP: Modular Vision-language Alignment with Identification Guarantees Authors:Shaoan Xie, Lingjing Kong, Yujia Zheng, Yu Yao, Zeyu Tang, Eric P. Xing, Guangyi Chen, Kun Zhang View a PDF of the paper titled SmartCLIP: Modular Vision-language Alignment with Identification Guarantees, by Shaoan Xie and 7 other authors View PDF Abstract:Contrastive Language-Image Pre-training (CLIP)~\citep{radford2021learning} has emerged as a pivotal model in computer vision and multimodal learning, achieving state-of-the-art performance at aligning visual and textual representations through contrastive learning. However, CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation. On the one hand, short captions for a single image in datasets like MSCOCO may describe disjoint regions in the image, leaving the model uncertain about which visual features to retain or disregard. On the other hand, directly aligning long captions with images can lead to the retention of entangled details, preventing the model from learning disentangled, atomic concepts -- ultimately limiting its generalization on certain downstream tasks involving short prompts. In this paper, we establish theoretical conditions that enable flexible alignment between textual and visual representatio...