[2603.02767] ITO: Images and Texts as One via Synergizing Multiple

[2603.02767] ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

arXiv - AI March 04, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.02767: ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

Computer Science > Computer Vision and Pattern Recognition arXiv:2603.02767 (cs) [Submitted on 3 Mar 2026] Title:ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion Authors:HanZpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zonglin Zhao, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He View a PDF of the paper titled ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion, by HanZpeng Liu and 8 other authors View PDF HTML (experimental) Abstract:Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early...

Originally published on March 04, 2026. Curated by AI News.

Llms

CLI for Google AI Search (gai.google) — run AI-powered code/tech searches headlessly from your terminal

Google AI (gai.google) gives Gemini-powered answers for technical queries — think AI-enhanced search with code understanding. I built a C...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Machine Learning

Big increase in the amount of people using AI to write their replies with AI

I find it interesting that we’ve all randomly decided to use the “-“ more often recently on reddit, and everyone’s grammar has drasticall...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Machine Learning

[D] MXFP8 GEMM: Up to 99% of cuBLAS performance using CUDA + PTX

New blog post by Daniel Vega-Myhre (Meta/PyTorch) illustrating GEMM design for FP8, including deep-dives into all the constraints and des...

Reddit - Machine Learning · 1 min · about 2 hours ago

Machine Learning

IIT Delhi launches 8th batch of Advanced AI, ML, and DL online programme: Check who is eligible, applicat

News News: The Continuing Education Programme (CEP) at IIT Delhi has announced the launch of the 8th batch of its Advanced Certificate Pr...

AI News - General · 9 min · about 2 hours ago

[2603.02767] ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

About this article

Related Articles

CLI for Google AI Search (gai.google) — run AI-powered code/tech searches headlessly from your terminal

Big increase in the amount of people using AI to write their replies with AI

[D] MXFP8 GEMM: Up to 99% of cuBLAS performance using CUDA + PTX

IIT Delhi launches 8th batch of Advanced AI, ML, and DL online programme: Check who is eligible, applicat

No comments

Stay updated with AI News