[2603.01471] Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality
About this article
Abstract page for arXiv paper 2603.01471: Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality
Computer Science > Information Retrieval arXiv:2603.01471 (cs) [Submitted on 2 Mar 2026] Title:Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality Authors:Jiahan Chen, Da Li, Hengran Zhang, Yinqiong Cai, Lixin Su, Jiafeng Guo, Daiting Shi, Dawei Yin, Keping Bi View a PDF of the paper titled Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality, by Jiahan Chen and 8 other authors View PDF HTML (experimental) Abstract:Multimodal embedding models, rooted in multimodal large language models (MLLMs), have yielded significant performance improvements across diverse tasks such as retrieval and classification. However, most existing approaches rely heavily on large-scale contrastive learning, with limited exploration of how the architectural and training paradigms of MLLMs affect embedding quality. While effective for generation, the causal attention and next-token prediction paradigm of MLLMs does not explicitly encourage the formation of globally compact representations, limiting their effectiveness as multimodal embedding backbones. To address this, we propose CoCoA, a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization. Specifically, we restructure the attention flow and introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding <EOS> embeddings. This drives the multimodal model to compres...