[2603.01050] MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline
About this article
Abstract page for arXiv paper 2603.01050: MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.01050 (cs) [Submitted on 1 Mar 2026] Title:MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline Authors:Huanjin Yao, Qixiang Yin, Min Yang, Ziwang Zhao, Yibo Wang, Haotian Luo, Jingyi Zhang, Jiaxing Huang View a PDF of the paper titled MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline, by Huanjin Yao and 7 other authors View PDF HTML (experimental) Abstract:We aim to develop a multimodal research agent capable of explicit reasoning and planning, multi-tool invocation, and cross-modal information synthesis, enabling it to conduct deep research tasks. However, we observe three main challenges in developing such agents: (1) scarcity of search-intensive multimodal QA data, (2) lack of effective search trajectories, and (3) prohibitive cost of training with online search APIs. To tackle them, we first propose Hyper-Search, a hypergraph-based QA generation method that models and connects visual and textual nodes within and across modalities, enabling to generate search-intensive multimodal QA pairs that require invoking various search tools to solve. Second, we introduce DR-TTS, which first decomposes search-involved tasks into several categories according to search tool types, and respectively optimize specialized search tool experts for each tool. It then recomposes tool experts to jointly explore search trajectories via tree search, producing trajectories that ...