[2602.02185] Vision-DeepResearch Benchmark: Rethinking Visual and

[2602.02185] Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

arXiv - Machine Learning March 03, 2026 4 min read

About this article

Abstract page for arXiv paper 2602.02185: Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.02185 (cs) [Submitted on 2 Feb 2026 (v1), last revised 28 Feb 2026 (this version, v2)] Title:Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models Authors:Yu Zeng, Wenxuan Huang, Zhen Fang, Shuang Chen, Yufan Shen, Yishuo Cai, Xiaoman Wang, Zhenfei Yin, Lin Chen, Zehui Chen, Shiting Huang, Yiming Zhao, Xu Tang, Yao Hu, Philip Torr, Wanli Ouyang, Shaosheng Cao View a PDF of the paper titled Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models, by Yu Zeng and 16 other authors View PDF HTML (experimental) Abstract:Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image-search side, the required information can often be obtained via near-exact matching against the full image, while the text-search side is overly direct and insufficiently challenging. To address these issues, we ...

Originally published on March 03, 2026. Curated by AI News.

Llms

Claude Mythos and misguided open-weight fearmongering

AI Tools & Products · 9 min · about 3 hours ago

Llms

Anthropic Agrees to Rent CoreWeave AI Capacity to Power Claude

AI Tools & Products · 1 min · about 3 hours ago

Llms

CoreWeave strikes a deal to power Anthropic's Claude AI models — and the stock surges 12%

AI Tools & Products · 3 min · about 3 hours ago

Llms

Walmart’s AI Push Links Gemini App Experience With U.S. Manufacturing Shift

AI Tools & Products · 6 min · about 3 hours ago

[2602.02185] Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

About this article

Related Articles

Claude Mythos and misguided open-weight fearmongering

Anthropic Agrees to Rent CoreWeave AI Capacity to Power Claude

CoreWeave strikes a deal to power Anthropic's Claude AI models — and the stock surges 12%

Walmart’s AI Push Links Gemini App Experience With U.S. Manufacturing Shift

No comments

Stay updated with AI News