[2510.00523] VIRTUE: Visual-Interactive Text-Image Universal Embedder
Summary
The paper presents VIRTUE, a novel Visual-Interactive Text-Image Universal Embedder that enhances multimodal representation learning by integrating visual-interactive capabilities, allowing for more precise user interactions and improved performance on various tasks.
Why It Matters
VIRTUE addresses a critical gap in existing embedding models by enabling visual interactions, which can significantly enhance user engagement and application versatility in AI. This advancement is particularly relevant as the demand for more intuitive AI systems grows, making it easier for users to specify their needs and for models to understand complex scenarios.
Key Takeaways
- VIRTUE integrates visual-interactive capabilities into embedding models.
- The model improves user interaction by allowing specific region targeting in images.
- It achieves state-of-the-art performance on multiple multimodal tasks.
- A new benchmark, SCaR, is introduced to evaluate its capabilities.
- This advancement opens new applications in AI that require localized user intent.
Computer Science > Artificial Intelligence arXiv:2510.00523 (cs) [Submitted on 1 Oct 2025 (v1), last revised 23 Feb 2026 (this version, v2)] Title:VIRTUE: Visual-Interactive Text-Image Universal Embedder Authors:Wei-Yao Wang, Kazuya Tateishi, Qiyu Wu, Shusuke Takahashi, Yuki Mitsufuji View a PDF of the paper titled VIRTUE: Visual-Interactive Text-Image Universal Embedder, by Wei-Yao Wang and 4 other authors View PDF HTML (experimental) Abstract:Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interest from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel Visual-InteRactive Text-Image Universal Embedder (VIRTUE) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning. In VIRTUE, the segmentation model can pr...