[2510.00436] Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization
About this article
Abstract page for arXiv paper 2510.00436: Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization
Computer Science > Artificial Intelligence arXiv:2510.00436 (cs) [Submitted on 1 Oct 2025 (v1), last revised 7 May 2026 (this version, v2)] Title:Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization Authors:Sarvesh Soni, Dina Demner-Fushman View a PDF of the paper titled Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization, by Sarvesh Soni and 1 other authors View PDF HTML (experimental) Abstract:Automated approaches to answer patient-posed health questions are rising, but selecting among systems requires reliable evaluation. The current gold standard for evaluating the free-text artificial intelligence (AI) responses--human expert review--is labor-intensive and slow, limiting scalability. Automated metrics are promising yet variably aligned with human judgments and often context-dependent. To address the feasibility of automating the evaluation of AI responses to hospitalization-related questions posed by patients, we conducted a large systematic study of evaluation approaches. Across 100 patient cases, we collected responses from 28 AI systems (2800 total) and assessed them along three dimensions: whether a system response (1) answers the question, (2) appropriately uses clinical note evidence, and (3) uses general medical knowledge. Using clinician-authored reference answers to anchor metrics, automated rankings closely matched human ratings. Our findings su...