[2603.00221] A medical coding language model trained on clinical narratives from a population-wide cohort of 1.8 million patients
About this article
Abstract page for arXiv paper 2603.00221: A medical coding language model trained on clinical narratives from a population-wide cohort of 1.8 million patients
Computer Science > Machine Learning arXiv:2603.00221 (cs) [Submitted on 27 Feb 2026] Title:A medical coding language model trained on clinical narratives from a population-wide cohort of 1.8 million patients Authors:Joakim Edin, Sedrah Butt Balaganeshan, Annike Kjølby Kristensen, Lars Maaløe, Ioannis Louloudis, Søren Brunak View a PDF of the paper titled A medical coding language model trained on clinical narratives from a population-wide cohort of 1.8 million patients, by Joakim Edin and 5 other authors View PDF Abstract:Medical coding translates clinical documentation into standardized codes for billing, research, and public health, but manual coding is time-consuming and error-prone. Existing automation efforts rely on small datasets that poorly represent real-world patient heterogeneity. We trained a language model on 5.8 million electronic health records from 1.8 million patients across nearly all specialties in Eastern Denmark (2006--2016) to predict ICD-10 codes from clinical notes, medications, and laboratory results. Evaluated on 270,000 held-out patients, the model achieved a micro F1 of 71.8% and a top-10 recall of 95.5%. Performance varied by specialty (F1: 53--91%), with higher scores in specialties with well-defined diagnostic criteria. Codes appearing predominantly as secondary diagnoses had markedly lower F1 scores. For three such codes (suicide-related behaviors, weight disorders, and hypertension), the model identified thousands of uncoded cases, of which...