[2603.01096] Unified Vision-Language Modeling via Concept Space Alignment
About this article
Abstract page for arXiv paper 2603.01096: Unified Vision-Language Modeling via Concept Space Alignment
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.01096 (cs) [Submitted on 1 Mar 2026] Title:Unified Vision-Language Modeling via Concept Space Alignment Authors:Yifu Qiu, Paul-Ambroise Duquenne, Holger Schwenk View a PDF of the paper titled Unified Vision-Language Modeling via Concept Space Alignment, by Yifu Qiu and 2 other authors View PDF HTML (experimental) Abstract:We introduce V-SONAR, a vision-language embedding space extended from the text-only embedding space SONAR (Omnilingual Embeddings Team et al., 2026), which supports 1500 text languages and 177 speech languages. To construct V-SONAR, we propose a post-hoc alignment pipeline that maps the representations of an existing vision encoder into the SONAR space. We thoroughly evaluate V-SONAR and show that its embeddings achieve competitive performance on text-to-video retrieval. Equipped with the OMNISONAR text decoder, V-SONAR further surpasses state-of-the-art vision-language models on video captioning tasks, including DREAM-1K (BLEU 23.9 vs. 19.6) and PE-VIDEO (BLEU 39.0 vs. 30.0). Leveraging V-SONAR, we first demonstrate that the Large Concept Model (LCM; LCM team et al. 2024) operating in SONAR and trained with English text only, can perform both single- and multi-visual concept understanding in a zero-shot manner. Finally, we introduce V-LCM, which extends the LCM with vision-language instruction tuning. V-LCM encodes vision and language inputs into an unified sequence of latent embedding...