[2602.21374] Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages
Summary
This study explores the use of small language models for extracting clinical information from low-resource languages, focusing on a privacy-preserving approach using a two-step pipeline with translation and model evaluation.
Why It Matters
As healthcare increasingly relies on natural language processing, this research addresses the critical challenge of extracting clinical data from low-resource languages. By demonstrating effective methods for privacy-preserving information extraction, it paves the way for improved healthcare analytics in multilingual settings, which is essential for equitable healthcare delivery.
Key Takeaways
- The study evaluates a two-step pipeline combining translation and small language models for clinical information extraction.
- Larger models consistently outperform smaller ones in extracting clinical features, highlighting the importance of model scale.
- Translating transcripts from Persian to English enhances sensitivity and reduces missing outputs, despite some trade-offs in precision.
- Reliable extraction of physiological symptoms was achieved, but challenges remain for psychological complaints and complex features.
- The research provides a practical framework for deploying language models in low-resource healthcare environments.
Computer Science > Computation and Language arXiv:2602.21374 (cs) [Submitted on 24 Feb 2026] Title:Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages Authors:Mohammadreza Ghaffarzadeh-Esfahani, Nahid Yousefian, Ebrahim Heidari-Farsani, Ali Akbar Omidvarian, Sepehr Ghahraei, Atena Farangi, AmirBahador Boroumand View a PDF of the paper titled Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages, by Mohammadreza Ghaffarzadeh-Esfahani and 6 other authors View PDF HTML (experimental) Abstract:Extracting clinical information from medical transcripts in low-resource languages remains a significant challenge in healthcare natural language processing (NLP). This study evaluates a two-step pipeline combining Aya-expanse-8B as a Persian-to-English translation model with five open-source small language models (SLMs) -- Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, Llama-3.2-3B-Instruct, Qwen2.5-1.5B-Instruct, and Gemma-3-1B-it -- for binary extraction of 13 clinical features from 1,221 anonymized Persian transcripts collected at a cancer palliative care call center. Using a few-shot prompting strategy without fine-tuning, models were assessed on macro-averaged F1-score, Matthews Correlation Coefficient (MCC), sensitivity, and specificity to account for class imbalance. Qwen2.5-7B-Instruct achieved the highest overall performance (median macro-F1: 0.899; MCC: 0.797), while Gemma-3-1B-it ...