[2603.09986] Quantifying Hallucinations in Language Language Models on Medical Textbooks
About this article
Abstract page for arXiv paper 2603.09986: Quantifying Hallucinations in Language Language Models on Medical Textbooks
Computer Science > Computation and Language arXiv:2603.09986 (cs) [Submitted on 12 Feb 2026 (v1), last revised 7 May 2026 (this version, v2)] Title:Quantifying Hallucinations in Language Language Models on Medical Textbooks Authors:Brandon C. Colelough, Davis Bartels, Dina Demner-Fushman View a PDF of the paper titled Quantifying Hallucinations in Language Language Models on Medical Textbooks, by Brandon C. Colelough and 2 other authors View PDF HTML (experimental) Abstract:Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments, the first experiment to determine the prevalence of hallucinations for a prominent open source large language model (LLaMA-70B-Instruct) in medical QA given closed-source zero-shot prompts, and the second experiment to determine the prevalence of hallucinations and clinician preference to model responses. We observed, in experiment one, with the passages provided, LLaMA-70B-Instruct hallucinated in 19.7\% of answers (95\% CI 18.6 to 20.7) even though 98.8\% of prompt responses received maximal plausibility, and ...