[2602.23353] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport
Summary
The paper introduces SOTAlign, a semi-supervised framework for aligning unimodal vision and language models using minimal paired data and large unpaired datasets, outperforming existing methods.
Why It Matters
This research addresses the challenge of aligning vision and language models with limited supervision, which is crucial for improving AI systems that rely on multimodal data. The findings could lead to more efficient model training and better performance in applications such as image captioning and visual question answering.
Key Takeaways
- SOTAlign utilizes a two-stage framework for model alignment.
- It effectively leverages unpaired data to enhance model performance.
- The method significantly outperforms both supervised and semi-supervised baselines.
Computer Science > Machine Learning arXiv:2602.23353 (cs) [Submitted on 26 Feb 2026] Title:SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport Authors:Simon Roschmann, Paul Krzakala, Sonia Mazelet, Quentin Bouniot, Zeynep Akata View a PDF of the paper titled SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport, by Simon Roschmann and 4 other authors View PDF HTML (experimental) Abstract:The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the world. Recent work exploits this convergence by aligning frozen pretrained vision and language models with lightweight alignment layers, but typically relies on contrastive losses and millions of paired samples. In this work, we ask whether meaningful alignment can be achieved with substantially less supervision. We introduce a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image-text pairs together with large amounts of unpaired data. To address this challenge, we propose SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, then refines the alignment on unpaired samples via an optimal-transport-based divergence that transfers relational structure without overconstraining the target space. Unlike existing semi-supervised methods, SOTAl...