[2602.18940] DREAM: Deep Research Evaluation with Agentic Metrics
Summary
The paper presents DREAM, a framework for evaluating Deep Research Agents, addressing challenges in assessing research quality through agentic metrics.
Why It Matters
As AI-generated content becomes prevalent, ensuring the quality and accuracy of research outputs is crucial. DREAM proposes a novel evaluation method that enhances the reliability of assessments, which is vital for academic integrity and informed decision-making in AI applications.
Key Takeaways
- DREAM introduces a framework for evaluating AI-generated research reports.
- It addresses the limitations of existing evaluation methods by focusing on factual correctness and temporal validity.
- The framework employs agentic metrics to enhance assessment accuracy.
- Controlled evaluations show DREAM's superiority over traditional benchmarks.
- DREAM aims to provide a scalable, reference-free evaluation paradigm.
Computer Science > Artificial Intelligence arXiv:2602.18940 (cs) [Submitted on 21 Feb 2026] Title:DREAM: Deep Research Evaluation with Agentic Metrics Authors:Elad Ben Avraham, Changhao Li, Ron Dorfman, Roy Ganz, Oren Nuriel, Amir Dudai, Aviad Aberdam, Noah Flynn, Elman Mansimov, Adi Kalyanpur, Ron Litman View a PDF of the paper titled DREAM: Deep Research Evaluation with Agentic Metrics, by Elad Ben Avraham and 10 other authors View PDF HTML (experimental) Abstract:Deep Research Agents generate analyst-grade reports, yet evaluating them remains challenging due to the absence of a single ground truth and the multidimensional nature of research quality. Recent benchmarks propose distinct methodologies, yet they suffer from the Mirage of Synthesis, where strong surface-level fluency and citation alignment can obscure underlying factual and reasoning defects. We characterize this gap by introducing a taxonomy across four verticals that exposes a critical capability mismatch: static evaluators inherently lack the tool-use capabilities required to assess temporal validity and factual correctness. To address this, we propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that instantiates the principle of capability parity by making evaluation itself agentic. DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent, enabling temporally aware coverage, grounded verificat...