[2602.12876] BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents
Summary
BrowseComp-$V^3$ introduces a new benchmark for evaluating multimodal browsing agents, focusing on complex reasoning across visual and textual data, revealing significant gaps in current AI capabilities.
Why It Matters
This benchmark addresses the limitations of existing multimodal browsing assessments, promoting fairness and reproducibility in AI evaluations. It highlights the challenges faced by state-of-the-art models, emphasizing the need for improved integration of multimodal information in real-world applications.
Key Takeaways
- BrowseComp-$V^3$ features 300 challenging questions for multimodal agents.
- The benchmark emphasizes multi-hop reasoning across text and visuals.
- Current state-of-the-art models only achieve 36% accuracy on this benchmark.
- The evaluation includes a subgoal-driven process for detailed analysis.
- It highlights critical gaps in multimodal information integration.
Computer Science > Artificial Intelligence arXiv:2602.12876 (cs) [Submitted on 13 Feb 2026] Title:BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents Authors:Huanyao Zhang, Jiepeng Zhou, Bo Li, Bowen Zhou, Yanzhe Dan, Haishan Lu, Zhiyong Cao, Jiaoyang Chen, Yuqian Han, Zinan Sheng, Zhengwei Tao, Hao Liang, Jialong Wu, Yang Shi, Yuanpeng He, Jiaye Lin, Qintong Zhang, Guochen Yan, Runhao Zhao, Zhengpin Li, Xiaohan Yu, Lang Mei, Chong Chen, Wentao Zhang, Bin Cui View a PDF of the paper titled BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents, by Huanyao Zhang and 24 other authors View PDF HTML (experimental) Abstract:Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments. However, existing benchmarks for multimodal browsing remain limited in task complexity, evidence accessibility, and evaluation granularity, hindering comprehensive and reproducible assessments of deep search capabilities. To address these limitations, we introduce BrowseComp-$V^3$, a novel benchmark consisting of 300 carefully curated and challenging questions spanning diverse domains. The benchmark emphasizes deep, multi-level, and cross-modal multi-hop reasoning, where critical evidence is interleaved across textual and visual modalities wit...