[2602.23353] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

[2602.23353] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

arXiv - AI 4 min read Article

Summary

The paper introduces SOTAlign, a semi-supervised framework for aligning unimodal vision and language models using minimal paired data and large unpaired datasets, outperforming existing methods.

Why It Matters

This research addresses the challenge of aligning vision and language models with limited supervision, which is crucial for improving AI systems that rely on multimodal data. The findings could lead to more efficient model training and better performance in applications such as image captioning and visual question answering.

Key Takeaways

  • SOTAlign utilizes a two-stage framework for model alignment.
  • It effectively leverages unpaired data to enhance model performance.
  • The method significantly outperforms both supervised and semi-supervised baselines.

Computer Science > Machine Learning arXiv:2602.23353 (cs) [Submitted on 26 Feb 2026] Title:SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport Authors:Simon Roschmann, Paul Krzakala, Sonia Mazelet, Quentin Bouniot, Zeynep Akata View a PDF of the paper titled SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport, by Simon Roschmann and 4 other authors View PDF HTML (experimental) Abstract:The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the world. Recent work exploits this convergence by aligning frozen pretrained vision and language models with lightweight alignment layers, but typically relies on contrastive losses and millions of paired samples. In this work, we ask whether meaningful alignment can be achieved with substantially less supervision. We introduce a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image-text pairs together with large amounts of unpaired data. To address this challenge, we propose SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, then refines the alignment on unpaired samples via an optimal-transport-based divergence that transfers relational structure without overconstraining the target space. Unlike existing semi-supervised methods, SOTAl...

Related Articles

Llms

[R] BraiNN: An Experimental Neural Architecture with Working Memory, Relational Reasoning, and Adaptive Learning

BraiNN An Experimental Neural Architecture with Working Memory, Relational Reasoning, and Adaptive Learning BraiNN is a compact research‑...

Reddit - Machine Learning · 1 min ·
Llms

We hit 150 stars on our AI setup tool!

yo folks, we just hit 150 stars on our open source tool that auto makes AI context files. got 90 PRs merged and 20 issues that ppl are pi...

Reddit - Artificial Intelligence · 1 min ·
Llms

Is ai getting dummer?

Over the past month, it feels like GPT and Gemini have been giving wrong answers a lot. Do you feel the same, or am I exaggerating? submi...

Reddit - Artificial Intelligence · 1 min ·
Llms

If AI is really making us more productive... why does it feel like we are working more, not less...?

The promise of AI was the ultimate system optimisation: Efficiency. On paper, the tools are delivering something similar to what they pro...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime