[2603.22767] Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases
About this article
Abstract page for arXiv paper 2603.22767: Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases
Computer Science > Artificial Intelligence arXiv:2603.22767 (cs) [Submitted on 24 Mar 2026] Title:Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases Authors:Dubai Li, Yuxiang He, Yan Hu, Yu Tian, Jingsong Li View a PDF of the paper titled Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases, by Dubai Li and 4 other authors View PDF HTML (experimental) Abstract:Observational studies can yield clinically actionable evidence at scale, but executing them on real-world databases is open-ended and requires coherent decisions across cohort construction, analysis, and reporting. Prior evaluations of LLM agents emphasize isolated steps or single answers, missing the integrity and internal structure of the resulting evidence bundle. To address this gap, we introduce RWE-bench, a benchmark grounded in MIMIC-IV and derived from peer-reviewed observational studies. Each task provides the corresponding study protocol as the reference standard, requiring agents to execute experiments in a real database and iteratively generate tree-structured evidence bundles. We evaluate six LLMs (three open-source, three closed-source) under three agent scaffolds using both question-level correctness and end-to-end task metrics. Across 162 tasks, task success is low: the best agent reaches 39.9%, and the best open-source model reaches 30.4%. Agent scaffolds also matter substantially, causing over 30% variation...