[2601.21654] ScholarGym: Benchmarking Large Language Model Capabilities in the Information-Gathering Stage of Deep Research
Summary
The paper introduces ScholarGym, an evaluation environment designed to benchmark large language models in the information-gathering phase of deep research, highlighting its structured approach to assessing query planning, tool invocation, and relevance assessment.
Why It Matters
As large language models evolve, understanding their capabilities in complex research tasks is crucial. ScholarGym provides a systematic framework for evaluating these models, which can enhance their performance and applicability in academic and professional settings. This research is relevant for developers and researchers aiming to improve AI-driven information retrieval.
Key Takeaways
- ScholarGym decomposes the research process into three stages: Query Planning, Tool Invocation, and Relevance Assessment.
- Iterative query decomposition significantly improves performance, yielding 2.9–3.3x F1 gains over single-query retrieval.
- The study identifies Query Planning quality and Relevance Assessment as critical bottlenecks affecting model performance.
Computer Science > Artificial Intelligence arXiv:2601.21654 (cs) [Submitted on 29 Jan 2026 (v1), last revised 14 Feb 2026 (this version, v2)] Title:ScholarGym: Benchmarking Large Language Model Capabilities in the Information-Gathering Stage of Deep Research Authors:Hao Shen, Hang Yang, Zhouhong Gu View a PDF of the paper titled ScholarGym: Benchmarking Large Language Model Capabilities in the Information-Gathering Stage of Deep Research, by Hao Shen and 2 other authors View PDF HTML (experimental) Abstract:Large language models have advanced from single-turn question answering to deep research systems that iteratively decompose research questions, invoke retrieval tools, and synthesize information across multiple rounds. Evaluating such systems typically involves scoring their final research reports holistically, but this end-to-end paradigm tightly couples the language model's decision-making, workflow design, and environmental feedback, precluding decomposable analysis of individual components. We introduce ScholarGym, an evaluation environment that isolates the information-gathering stage of deep research on academic literature. Under a unified workflow, ScholarGym decomposes the research process into three explicit stages -- Query Planning, Tool Invocation, and Relevance Assessment -- and evaluates each against 2,536 expert-annotated queries over a static corpus of 570K papers with deterministic retrieval. Systematic experiments reveal that iterative query decompositi...