[2511.21331] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment
About this article
Abstract page for arXiv paper 2511.21331: The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment
Computer Science > Computer Vision and Pattern Recognition arXiv:2511.21331 (cs) [Submitted on 26 Nov 2025 (v1), last revised 3 Apr 2026 (this version, v2)] Title:The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment Authors:Stefanos Koutoupis, Michaela Areti Zervou, Konstantinos Kontras, Maarten De Vos, Panagiotis Tsakalides, Grigorios Tsagkatakis View a PDF of the paper titled The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment, by Stefanos Koutoupis and 5 other authors View PDF HTML (experimental) Abstract:Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies...