[2511.03441] CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field
About this article
Abstract page for arXiv paper 2511.03441: CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field
Computer Science > Computation and Language arXiv:2511.03441 (cs) [Submitted on 5 Nov 2025 (v1), last revised 4 Mar 2026 (this version, v3)] Title:CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field Authors:Doria Bonzi, Alexandre Guiggi, Frédéric Béchet, Carlos Ramisch, Benoit Favre View a PDF of the paper titled CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field, by Doria Bonzi and 4 other authors View PDF HTML (experimental) Abstract:Critical appraisal of scientific literature is an essential skill in the biomedical field. While large language models (LLMs) can offer promising support in this task, their reliability remains limited, particularly for critical reasoning in specialized domains. We introduce CareMedEval, an original dataset designed to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Derived from authentic exams taken by French medical students, the dataset contains 534 questions based on 37 scientific articles. Unlike existing benchmarks, CareMedEval explicitly evaluates critical reading and reasoning grounded in scientific papers. Benchmarking state-of-the-art generalist and biomedical-specialized LLMs under various context conditions reveals the difficulty of the task: open and commercial models fail to exceed an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens considerably improves the results. Yet, models remain challenged especially...