[2510.10889] Topological Alignment of Shared Vision-Language Embedding Space
About this article
Abstract page for arXiv paper 2510.10889: Topological Alignment of Shared Vision-Language Embedding Space
Computer Science > Computer Vision and Pattern Recognition arXiv:2510.10889 (cs) [Submitted on 13 Oct 2025 (v1), last revised 4 Mar 2026 (this version, v2)] Title:Topological Alignment of Shared Vision-Language Embedding Space Authors:Junwon You, Dasol Kang, Jae-Hun Jung View a PDF of the paper titled Topological Alignment of Shared Vision-Language Embedding Space, by Junwon You and 2 other authors View PDF HTML (experimental) Abstract:Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot capabilities. However, their cross-modal alignment remains biased toward English due to limited multilingual multimodal data. Recent multilingual extensions have alleviated this gap but enforce instance-level alignment while neglecting the global geometry of the shared embedding space. We address this problem by introducing ToMCLIP (Topological Alignment for Multilingual CLIP), a topology-aware framework aligning embedding spaces with topology-preserving constraints. The proposed method applies persistent homology to define a topological alignment loss and approximates persistence diagram with theoretical error bounds using graph sparsification strategy. This work validates the proposed approach, showing enhanced structural coherence of multilingual representations, higher zero-shot accuracy on the CIFAR-100, and stronger multilingual retrieval performance on the xFlickr&CO. Beyond VLMs, the proposed approach provides a general method for incorporating topological a...