[2602.19605] CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning
Summary
The paper presents CLCR, a novel approach for multimodal learning that organizes features into a three-level semantic hierarchy to enhance representation quality and reduce semantic misalignment.
Why It Matters
As multimodal learning becomes increasingly vital in AI applications, addressing the challenges of semantic misalignment and feature representation is crucial. CLCR's innovative framework could significantly improve performance in various tasks, making it relevant for researchers and practitioners in the field.
Key Takeaways
- CLCR introduces a three-level semantic hierarchy for multimodal data.
- The model enhances feature alignment and reduces error propagation.
- Intra-Level and Inter-Level mechanisms ensure effective cross-modal interactions.
- Empirical results show strong performance across multiple benchmarks.
- The approach is applicable to diverse tasks such as emotion recognition and sentiment analysis.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.19605 (cs) [Submitted on 23 Feb 2026] Title:CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning Authors:Chunlei Meng, Guanhong Huang, Rong Fu, Runmin Jian, Zhongxue Gan, Chun Ouyang View a PDF of the paper titled CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning, by Chunlei Meng and 5 other authors View PDF HTML (experimental) Abstract:Multimodal learning aims to capture both shared and private information from multiple modalities. However, existing methods that project all modalities into a single latent space for fusion often overlook the asynchronous, multi-level semantic structure of multimodal data. This oversight induces semantic misalignment and error propagation, thereby degrading representation quality. To address this issue, we propose Cross-Level Co-Representation (CLCR), which explicitly organizes each modality's features into a three-level semantic hierarchy and specifies level-wise constraints for cross-modal interactions. First, a semantic hierarchy encoder aligns shallow, mid, and deep features across modalities, establishing a common basis for interaction. And then, at each level, an Intra-Level Co-Exchange Domain (IntraCED) factorizes features into shared and private subspaces and restricts cross-modal attention to the shared subspace via a learnable token budget. This design ensures that only shared semantics are exchanged and p...