[2502.17028] Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
Summary
The paper presents CS-Aligner, a novel framework for vision-language alignment that integrates Cauchy-Schwarz divergence with mutual information, addressing limitations of previous methods like InfoNCE.
Why It Matters
This research is significant as it proposes a solution to the alignment-uniformity conflict in multimodal learning, enhancing the performance of tasks such as text-to-image generation and cross-modal retrieval. By improving vision-language alignment, it opens new avenues for applications in AI and machine learning.
Key Takeaways
- CS-Aligner improves vision-language alignment by integrating Cauchy-Schwarz divergence with mutual information.
- The framework captures both global distribution and pairwise semantic relationships, enhancing alignment precision.
- CS-Aligner addresses the inherent conflicts of previous methods, enabling better performance in multimodal tasks.
- Experiments demonstrate its effectiveness in text-to-image generation and cross-modality retrieval.
- The approach allows for the incorporation of unpaired data, enhancing flexibility in alignment.
Computer Science > Machine Learning arXiv:2502.17028 (cs) [Submitted on 24 Feb 2025 (v1), last revised 24 Feb 2026 (this version, v3)] Title:Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence Authors:Wenzhe Yin, Zehao Xiao, Pan Zhou, Shujian Yu, Jiayi Shen, Jan-Jakob Sonke, Efstratios Gavves View a PDF of the paper titled Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence, by Wenzhe Yin and 6 other authors View PDF HTML (experimental) Abstract:Vision-language alignment is crucial for various downstream tasks such as cross-modal generation and retrieval. Previous multimodal approaches like CLIP utilize InfoNCE to maximize mutual information, primarily aligning pairwise samples across modalities while overlooking distributional differences. In addition, InfoNCE has inherent conflict in terms of alignment and uniformity in multimodality, leading to suboptimal alignment with modality gaps. To overcome the limitations, we propose CS-Aligner, a novel framework that performs distributional vision-language alignment by integrating Cauchy-Schwarz (CS) divergence with mutual information. CS-Aligner captures both the global distribution information of each modality and the pairwise semantic relationships. We find that the CS divergence seamlessly addresses the InfoNCE's alignment-uniformity conflict and serves complementary roles with InfoNCE, yielding tighter and more precise alignment. Moreover, by introducing distributional alignment, CS-Align...