[2601.03266] Benchmarking and Adapting On-Device LLMs for Clinical Decision Support
About this article
Abstract page for arXiv paper 2601.03266: Benchmarking and Adapting On-Device LLMs for Clinical Decision Support
Computer Science > Computation and Language arXiv:2601.03266 (cs) [Submitted on 18 Dec 2025 (v1), last revised 27 Apr 2026 (this version, v2)] Title:Benchmarking and Adapting On-Device LLMs for Clinical Decision Support Authors:Alif Munim, Jun Ma, Omar Ibrahim, Alhusain Abdalla, Shuolin Yin, Leo Chen, Bo Wang View a PDF of the paper titled Benchmarking and Adapting On-Device LLMs for Clinical Decision Support, by Alif Munim and 6 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) have rapidly advanced in clinical decision-making, yet the deployment of proprietary systems is hindered by privacy concerns and reliance on cloud-based infrastructure. Open-source alternatives allow local inference but often have large model sizes that limit their use in resource-constrained clinical settings. Here, we benchmark on-device LLMs from the gpt-oss (20b, 120b), Qwen3.5 (9B, 27B, 35B), and Gemma 4 (31B) families across three representative clinical tasks: general disease diagnosis, specialty-specific (ophthalmology) diagnosis and management, and simulation of human expert grading and evaluation. We compare their performance with state-of-the-art proprietary models (GPT-5.1, GPT-5-mini, and Gemini 3.1 Pro) and a leading open-source model (DeepSeek-R1), and we further evaluate the adaptability of on-device systems by fine-tuning gpt-oss-20b and Qwen3.5-35B on general diagnostic data. Across tasks, on-device models achieve performance comparable to or exceedi...