[2503.04812] LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
About this article
Abstract page for arXiv paper 2503.04812: LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
Computer Science > Computer Vision and Pattern Recognition arXiv:2503.04812 (cs) [Submitted on 4 Mar 2025 (v1), last revised 1 Mar 2026 (this version, v2)] Title:LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning Authors:Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, Jinsong Su View a PDF of the paper titled LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning, by Zhibin Lan and 4 other authors View PDF HTML (experimental) Abstract:Universal multimodal embedding models play a critical role in tasks such as interleaved image-text retrieval, multimodal RAG, and multimodal clustering. However, our empirical results indicate that existing LMM-based embedding models trained with the standard InfoNCE loss exhibit a high degree of overlap in similarity distribution between positive and negative pairs, making it challenging to distinguish hard negative pairs effectively. To deal with this issue, we propose a simple yet effective framework that dynamically improves the embedding model's representation learning for negative pairs based on their discriminative difficulty. Within this framework, we train a series of models, named LLaVE, and evaluate them on the MMEB benchmark, which covers 4 meta-tasks and 36 datasets. Experimental results show that LLaVE establishes stronger baselines that achieve state-of-the-art (SOTA) performance while demonstrating strong scalability and efficiency. Specifically...