[2603.11413] Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI
About this article
Abstract page for arXiv paper 2603.11413: Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI
Computer Science > Human-Computer Interaction arXiv:2603.11413 (cs) [Submitted on 12 Mar 2026 (v1), last revised 26 Mar 2026 (this version, v3)] Title:Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI Authors:David Fraile Navarro, Farah Magrabi, Enrico Coiera View a PDF of the paper titled Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI, by David Fraile Navarro and 2 other authors View PDF HTML (experimental) Abstract:Ramaswamy et al. reported in Nature Medicine that ChatGPT Health under-triages 51.6% of emergencies, concluding that consumer-facing AI triage poses safety risks. However, their evaluation used an exam-style protocol -- forced A/B/C/D output, knowledge suppression, and suppression of clarifying questions -- that differs fundamentally from how consumers use health chatbots. We tested five frontier LLMs (GPT-5.2, Claude Sonnet 4.6, Claude Opus 4.6, Gemini 3 Flash, Gemini 3.1 Pro) on a 17-scenario partial replication bank under constrained (exam-style, 1,275 trials) and naturalistic (patient-style messages, 850 trials) conditions, with targeted ablations and prompt-faithful checks using the authors' released prompts. Naturalistic interaction improved triage accuracy by 6.4 percentage points ($p = 0.015$). Diabetic ketoacidosis was correctly triaged in 100% of trials across all models and conditions. Asthma triage improved from 48% to 80%. The forced A/B/C/...