[2506.01062] SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
About this article
Abstract page for arXiv paper 2506.01062: SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
Computer Science > Computation and Language arXiv:2506.01062 (cs) [Submitted on 1 Jun 2025 (v1), last revised 5 Mar 2026 (this version, v3)] Title:SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models Authors:Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, Tu Vu View a PDF of the paper titled SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models, by Thinh Pham and 5 other authors View PDF HTML (experimental) Abstract:We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does...