[2509.11517] PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation
Summary
The PeruMedQA study evaluates large language models (LLMs) on Peruvian medical exams, creating a specialized dataset and demonstrating the performance of various models in a Latin American context.
Why It Matters
As LLMs gain traction in medical applications, understanding their effectiveness in specific cultural and linguistic contexts is crucial. This research highlights the performance of LLMs on Spanish-language medical exams, providing insights for future AI applications in Latin America.
Key Takeaways
- PeruMedQA dataset consists of 8,380 medical questions across 12 specialties.
- Fine-tuning LLMs significantly improves their performance on specialized medical questions.
- Medgemma-27b achieved the highest accuracy, particularly in Psychiatry, showcasing the potential of tailored LLMs.
Computer Science > Computation and Language arXiv:2509.11517 (cs) [Submitted on 15 Sep 2025 (v1), last revised 25 Feb 2026 (this version, v2)] Title:PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation Authors:Rodrigo M. Carrillo-Larco, Jesus Lovón Melgarejo, Manuel Castillo-Cara, Gusseppe Bravo-Rocca View a PDF of the paper titled PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation, by Rodrigo M. Carrillo-Larco and 3 other authors View PDF Abstract:BACKGROUND: Medical large language models (LLMs) have demonstrated remarkable performance in answering medical examinations. However, the extent to which this high performance is transferable to medical questions in Spanish and from a Latin American country remains unexplored. This knowledge is crucial as LLM-based medical applications gain traction in Latin America. AIMS: To build a dataset of questions medical examinations taken by Peruvian physicians pursuing specialty training; to fine-tune a LLM on this dataset; to evaluate and compare the performance in terms of accuracy between vanilla LLMs and the fine-tuned LLM. METHODS: We curated PeruMedQA, a multiple-choice question-answering (MCQA) dataset containing 8,380 questions spanning 12 specialties (2018-2025). We selected ten medical LLMs, including medgemma-4b-it and medgemma-27b-text-it, and developed zero-shot task specific prompts to answer...