[2603.01152] DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent
About this article
Abstract page for arXiv paper 2603.01152: DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent
Computer Science > Artificial Intelligence arXiv:2603.01152 (cs) [Submitted on 1 Mar 2026] Title:DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent Authors:Tongzhou Wu, Yuhao Wang, Xinyu Ma, Xiuqiang He, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao View a PDF of the paper titled DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent, by Tongzhou Wu and 6 other authors View PDF HTML (experimental) Abstract:Deep-research agents are capable of executing multi-step web exploration, targeted retrieval, and sophisticated question answering. Despite their powerful capabilities, deep-research agents face two critical bottlenecks: (1) the lack of large-scale, challenging datasets with real-world difficulty, and (2) the absence of accessible, open-source frameworks for data synthesis and agent training. To bridge these gaps, we first construct DeepResearch-9K, a large-scale challenging dataset specifically designed for deep-research scenarios built from open-source multi-hop question-answering (QA) datasets via a low-cost autonomous pipeline. Notably, it consists of (1) 9000 questions spanning three difficulty levels from L1 to L3 (2) high-quality search trajectories with reasoning chains from Tongyi-DeepResearch-30B-A3B, a state-of-the-art deep-research agent, and (3) verifiable answers. Furthermore, we develop an open-source training framework DeepResearch-R1 that supports (1) multi-turn web interactions, (2) different reinforcement learning (RL) ...