[2512.20983] Automatic Replication of LLM Mistakes in Medical Conversations
About this article
Abstract page for arXiv paper 2512.20983: Automatic Replication of LLM Mistakes in Medical Conversations
Computer Science > Computation and Language arXiv:2512.20983 (cs) [Submitted on 24 Dec 2025 (v1), last revised 7 Apr 2026 (this version, v2)] Title:Automatic Replication of LLM Mistakes in Medical Conversations Authors:Oleksii Proniakin, Diego Fajardo, Ruslan Nazarenko, Razvan Marinescu View a PDF of the paper titled Automatic Replication of LLM Mistakes in Medical Conversations, by Oleksii Proniakin and Diego Fajardo and Ruslan Nazarenko and Razvan Marinescu View PDF HTML (experimental) Abstract:Large language models (LLMs) are increasingly evaluated in clinical settings using multi-dimensional rubrics which quantify reasoning quality, safety, and patient-centeredness. Yet, replicating specific mistakes in other LLM models is not straightforward and often requires manual effort. We introduce MedMistake, an automatic pipeline that extracts mistakes LLMs make in patient-doctor conversations and converts them into a benchmark of single-shot QA pairs. Our pipeline (1) creates complex, conversational data between an LLM patient and LLM doctor, (2) runs an evaluation with a committee of 2 LLM judges across a variety of dimensions and (3) creates simplified single-shot QA scenarios from those mistakes. We release MedMistake-All, a dataset of 3,390 single-shot QA pairs where GPT-5 and Gemini 2.5 Pro are currently failing to answer correctly, as judged by two LLM judges. We used medical experts to validate a subset of 211/3390 questions (MedMistake-Bench), which we used to run a f...