[2510.00523] VIRTUE: Visual-Interactive Text-Image Universal Embedder

[2510.00523] VIRTUE: Visual-Interactive Text-Image Universal Embedder

arXiv - AI 4 min read Article

Summary

The paper presents VIRTUE, a novel Visual-Interactive Text-Image Universal Embedder that enhances multimodal representation learning by integrating visual-interactive capabilities, allowing for more precise user interactions and improved performance on various tasks.

Why It Matters

VIRTUE addresses a critical gap in existing embedding models by enabling visual interactions, which can significantly enhance user engagement and application versatility in AI. This advancement is particularly relevant as the demand for more intuitive AI systems grows, making it easier for users to specify their needs and for models to understand complex scenarios.

Key Takeaways

  • VIRTUE integrates visual-interactive capabilities into embedding models.
  • The model improves user interaction by allowing specific region targeting in images.
  • It achieves state-of-the-art performance on multiple multimodal tasks.
  • A new benchmark, SCaR, is introduced to evaluate its capabilities.
  • This advancement opens new applications in AI that require localized user intent.

Computer Science > Artificial Intelligence arXiv:2510.00523 (cs) [Submitted on 1 Oct 2025 (v1), last revised 23 Feb 2026 (this version, v2)] Title:VIRTUE: Visual-Interactive Text-Image Universal Embedder Authors:Wei-Yao Wang, Kazuya Tateishi, Qiyu Wu, Shusuke Takahashi, Yuki Mitsufuji View a PDF of the paper titled VIRTUE: Visual-Interactive Text-Image Universal Embedder, by Wei-Yao Wang and 4 other authors View PDF HTML (experimental) Abstract:Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interest from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel Visual-InteRactive Text-Image Universal Embedder (VIRTUE) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning. In VIRTUE, the segmentation model can pr...

Related Articles

Llms

Why are we blindly trusting AI companies with our data?

Lately I’ve been seeing a story floating around that really made me pause. Apparently, there were claims that the US government asked Ant...

Reddit - Artificial Intelligence · 1 min ·
De-aged casts, ChatGPT-generated programs: How AI is changing Korean TV
Llms

De-aged casts, ChatGPT-generated programs: How AI is changing Korean TV

Artificial intelligence is transforming every corner of industry, and television is no exception. Major networks in Korea have recently a...

AI Tools & Products · 4 min ·
[2603.16629] MLLM-based Textual Explanations for Face Comparison
Llms

[2603.16629] MLLM-based Textual Explanations for Face Comparison

Abstract page for arXiv paper 2603.16629: MLLM-based Textual Explanations for Face Comparison

arXiv - AI · 4 min ·
[2603.15159] To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation
Llms

[2603.15159] To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

Abstract page for arXiv paper 2603.15159: To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

arXiv - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime