Llms Open Source Ai Machine Learning Nlp Ai Infrastructure Ai Safety

[2602.21374] Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages

arXiv - Machine Learning February 26, 2026 4 min read Article

Summary

This study explores the use of small language models for extracting clinical information from low-resource languages, focusing on a privacy-preserving approach using a two-step pipeline with translation and model evaluation.

Why It Matters

As healthcare increasingly relies on natural language processing, this research addresses the critical challenge of extracting clinical data from low-resource languages. By demonstrating effective methods for privacy-preserving information extraction, it paves the way for improved healthcare analytics in multilingual settings, which is essential for equitable healthcare delivery.

Key Takeaways

The study evaluates a two-step pipeline combining translation and small language models for clinical information extraction.
Larger models consistently outperform smaller ones in extracting clinical features, highlighting the importance of model scale.
Translating transcripts from Persian to English enhances sensitivity and reduces missing outputs, despite some trade-offs in precision.
Reliable extraction of physiological symptoms was achieved, but challenges remain for psychological complaints and complex features.
The research provides a practical framework for deploying language models in low-resource healthcare environments.

Computer Science > Computation and Language arXiv:2602.21374 (cs) [Submitted on 24 Feb 2026] Title:Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages Authors:Mohammadreza Ghaffarzadeh-Esfahani, Nahid Yousefian, Ebrahim Heidari-Farsani, Ali Akbar Omidvarian, Sepehr Ghahraei, Atena Farangi, AmirBahador Boroumand View a PDF of the paper titled Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages, by Mohammadreza Ghaffarzadeh-Esfahani and 6 other authors View PDF HTML (experimental) Abstract:Extracting clinical information from medical transcripts in low-resource languages remains a significant challenge in healthcare natural language processing (NLP). This study evaluates a two-step pipeline combining Aya-expanse-8B as a Persian-to-English translation model with five open-source small language models (SLMs) -- Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, Llama-3.2-3B-Instruct, Qwen2.5-1.5B-Instruct, and Gemma-3-1B-it -- for binary extraction of 13 clinical features from 1,221 anonymized Persian transcripts collected at a cancer palliative care call center. Using a few-shot prompting strategy without fine-tuning, models were assessed on macro-averaged F1-score, Matthews Correlation Coefficient (MCC), sensitivity, and specificity to account for class imbalance. Qwen2.5-7B-Instruct achieved the highest overall performance (median macro-F1: 0.899; MCC: 0.797), while Gemma-3-1B-it ...

Read Original Article

[2602.21374] Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages

Summary

Why It Matters

Key Takeaways

Related Articles

built an open source CLI that auto generates AI setup files for your projects just hit 150 stars

built an open source tool that auto generates AI context files for any codebase, 150 stars in

Find out what’s new in the Gemini app in March's Gemini Drop.

Amazon is selling vintage-style ChatGPT AI smart glasses for $14 with a translator function

No comments

Stay updated with AI News