[2602.08543] GISA: A Benchmark for General Information-Seeking Assistant
Summary
The paper introduces GISA, a benchmark designed for evaluating General Information-Seeking Assistants, addressing limitations in existing benchmarks by providing realistic queries and structured answer formats.
Why It Matters
As large language models evolve, effective evaluation benchmarks are critical for developing search agents that can accurately gather information. GISA aims to improve the alignment of benchmarks with real-world information-seeking scenarios, enhancing the capabilities of AI systems in practical applications.
Key Takeaways
- GISA includes 373 human-crafted queries reflecting real information-seeking scenarios.
- It features structured answer formats for deterministic evaluation.
- The benchmark integrates deep reasoning and broad information aggregation.
- GISA provides complete human search trajectories for process-level supervision.
- Current LLMs show limited performance, indicating significant room for improvement.
Computer Science > Computation and Language arXiv:2602.08543 (cs) [Submitted on 9 Feb 2026 (v1), last revised 13 Feb 2026 (this version, v2)] Title:GISA: A Benchmark for General Information-Seeking Assistant Authors:Yutao Zhu, Xingshuo Zhang, Maosen Zhang, Jiajie Jin, Liancheng Zhang, Xiaoshuai Song, Kangzhi Zhao, Wencong Zeng, Ruiming Tang, Han Li, Ji-Rong Wen, Zhicheng Dou View a PDF of the paper titled GISA: A Benchmark for General Information-Seeking Assistant, by Yutao Zhu and 11 other authors View PDF HTML (experimental) Abstract:The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gathering information through multi-turn web interactions. Various benchmarks have been proposed to evaluate such agents. However, existing benchmarks often construct queries backward from answers, producing unnatural tasks misaligned with real-world needs. Moreover, these benchmarks tend to focus on either locating specific information or aggregating information from multiple sources, while relying on static answer sets prone to data contamination. To bridge these gaps, we introduce GISA, a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries that reflect authentic information-seeking scenarios. GISA features four structured answer formats (item, set, list, and table), enabling deterministic evaluation. It integrates both deep reasoning and broad information aggregation w...