[2603.09986] Quantifying Hallucinations in Language Language Models on

[2603.09986] Quantifying Hallucinations in Language Language Models on Medical Textbooks

arXiv - AI May 09, 2026 4 min read

About this article

Abstract page for arXiv paper 2603.09986: Quantifying Hallucinations in Language Language Models on Medical Textbooks

Computer Science > Computation and Language arXiv:2603.09986 (cs) [Submitted on 12 Feb 2026 (v1), last revised 7 May 2026 (this version, v2)] Title:Quantifying Hallucinations in Language Language Models on Medical Textbooks Authors:Brandon C. Colelough, Davis Bartels, Dina Demner-Fushman View a PDF of the paper titled Quantifying Hallucinations in Language Language Models on Medical Textbooks, by Brandon C. Colelough and 2 other authors View PDF HTML (experimental) Abstract:Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments, the first experiment to determine the prevalence of hallucinations for a prominent open source large language model (LLaMA-70B-Instruct) in medical QA given closed-source zero-shot prompts, and the second experiment to determine the prevalence of hallucinations and clinician preference to model responses. We observed, in experiment one, with the passages provided, LLaMA-70B-Instruct hallucinated in 19.7\% of answers (95\% CI 18.6 to 20.7) even though 98.8\% of prompt responses received maximal plausibility, and ...

Originally published on May 09, 2026. Curated by AI News.

Llms

GPT-5.5 may burn fewer tokens, but it always burns more cash

submitted by /u/NISMO1968 [link] [comments]

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Llms

[2605.03213] When Agents Handle Secrets: A Survey of Confidential Computing for Agentic AI

Abstract page for arXiv paper 2605.03213: When Agents Handle Secrets: A Survey of Confidential Computing for Agentic AI

arXiv - AI · 4 min · about 3 hours ago

Llms

[2604.17866] Latent Abstraction for Retrieval-Augmented Generation

Abstract page for arXiv paper 2604.17866: Latent Abstraction for Retrieval-Augmented Generation

arXiv - AI · 4 min · about 3 hours ago

Llms

[2603.15270] From Documents to Spans: Scalable Supervision for Evidence-Based ICD Coding with LLMs

Abstract page for arXiv paper 2603.15270: From Documents to Spans: Scalable Supervision for Evidence-Based ICD Coding with LLMs

arXiv - AI · 4 min · about 3 hours ago

[2603.09986] Quantifying Hallucinations in Language Language Models on Medical Textbooks

About this article

Related Articles

GPT-5.5 may burn fewer tokens, but it always burns more cash

[2605.03213] When Agents Handle Secrets: A Survey of Confidential Computing for Agentic AI

[2604.17866] Latent Abstraction for Retrieval-Augmented Generation

[2603.15270] From Documents to Spans: Scalable Supervision for Evidence-Based ICD Coding with LLMs

No comments

Stay updated with AI News