[2602.14983] Orthogonalized Multimodal Contrastive Learning with Asymmetric Masking for Structured Representations
Summary
The paper presents COrAL, a novel framework for multimodal contrastive learning that effectively separates redundant, unique, and synergistic information, enhancing representation quality.
Why It Matters
This research addresses key challenges in multimodal learning by improving how models capture and utilize different types of information. By explicitly modeling interactions and reducing redundancy, it has implications for various applications in AI, particularly in enhancing the performance of machine learning systems across diverse datasets.
Key Takeaways
- COrAL framework improves multimodal representation by disentangling information types.
- Asymmetric masking enhances the model's ability to infer cross-modal dependencies.
- The framework consistently outperforms state-of-the-art methods with lower performance variance.
Computer Science > Machine Learning arXiv:2602.14983 (cs) [Submitted on 16 Feb 2026] Title:Orthogonalized Multimodal Contrastive Learning with Asymmetric Masking for Structured Representations Authors:Carolin Cissee, Raneen Younis, Zahra Ahmadi View a PDF of the paper titled Orthogonalized Multimodal Contrastive Learning with Asymmetric Masking for Structured Representations, by Carolin Cissee and 2 other authors View PDF HTML (experimental) Abstract:Multimodal learning seeks to integrate information from heterogeneous sources, where signals may be shared across modalities, specific to individual modalities, or emerge only through their interaction. While self-supervised multimodal contrastive learning has achieved remarkable progress, most existing methods predominantly capture redundant cross-modal signals, often neglecting modality-specific (unique) and interaction-driven (synergistic) information. Recent extensions broaden this perspective, yet they either fail to explicitly model synergistic interactions or learn different information components in an entangled manner, leading to incomplete representations and potential information leakage. We introduce \textbf{COrAL}, a principled framework that explicitly and simultaneously preserves redundant, unique, and synergistic information within multimodal representations. COrAL employs a dual-path architecture with orthogonality constraints to disentangle shared and modality-specific features, ensuring a clean separation of...