Llms Machine Learning Data Science Ai Startups Nlp Ai Safety

[2509.11517] PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation

arXiv - Machine Learning February 26, 2026 4 min read Article

Summary

The PeruMedQA study evaluates large language models (LLMs) on Peruvian medical exams, creating a specialized dataset and demonstrating the performance of various models in a Latin American context.

Why It Matters

As LLMs gain traction in medical applications, understanding their effectiveness in specific cultural and linguistic contexts is crucial. This research highlights the performance of LLMs on Spanish-language medical exams, providing insights for future AI applications in Latin America.

Key Takeaways

PeruMedQA dataset consists of 8,380 medical questions across 12 specialties.
Fine-tuning LLMs significantly improves their performance on specialized medical questions.
Medgemma-27b achieved the highest accuracy, particularly in Psychiatry, showcasing the potential of tailored LLMs.

Computer Science > Computation and Language arXiv:2509.11517 (cs) [Submitted on 15 Sep 2025 (v1), last revised 25 Feb 2026 (this version, v2)] Title:PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation Authors:Rodrigo M. Carrillo-Larco, Jesus Lovón Melgarejo, Manuel Castillo-Cara, Gusseppe Bravo-Rocca View a PDF of the paper titled PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation, by Rodrigo M. Carrillo-Larco and 3 other authors View PDF Abstract:BACKGROUND: Medical large language models (LLMs) have demonstrated remarkable performance in answering medical examinations. However, the extent to which this high performance is transferable to medical questions in Spanish and from a Latin American country remains unexplored. This knowledge is crucial as LLM-based medical applications gain traction in Latin America. AIMS: To build a dataset of questions medical examinations taken by Peruvian physicians pursuing specialty training; to fine-tune a LLM on this dataset; to evaluate and compare the performance in terms of accuracy between vanilla LLMs and the fine-tuned LLM. METHODS: We curated PeruMedQA, a multiple-choice question-answering (MCQA) dataset containing 8,380 questions spanning 12 specialties (2018-2025). We selected ten medical LLMs, including medgemma-4b-it and medgemma-27b-text-it, and developed zero-shot task specific prompts to answer...

Read Original Article

Llms

Have Companies Began Adopting Claude Co-Work at an Enterprise Level?

Hi Guys, My company is considering purchasing the Claude Enterprise plan. The main two constraints are: - Being able to block usage of Cl...

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

Llms

What I learned about multi-agent coordination running 9 specialized Claude agents

I've been experimenting with multi-agent AI systems and ended up building something more ambitious than I originally planned: a fully ope...

Reddit - Artificial Intelligence · 1 min · about 5 hours ago

Llms

[D] The problem with comparing AI memory system benchmarks — different evaluation methods make scores meaningless

I've been reviewing how various AI memory systems evaluate their performance and noticed a fundamental issue with cross-system comparison...

Reddit - Machine Learning · 1 min · about 5 hours ago

Llms

Shifting to AI model customization is an architectural imperative | MIT Technology Review

In the early days of large language models (LLMs), we grew accustomed to massive 10x jumps in reasoning and coding capability with every ...

MIT Technology Review · 6 min · about 6 hours ago

[2509.11517] PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation

Summary

Why It Matters

Key Takeaways

Related Articles

Have Companies Began Adopting Claude Co-Work at an Enterprise Level?

What I learned about multi-agent coordination running 9 specialized Claude agents

[D] The problem with comparing AI memory system benchmarks — different evaluation methods make scores meaningless

Shifting to AI model customization is an architectural imperative | MIT Technology Review

No comments

Stay updated with AI News