[2602.14158] A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing
Summary
This article presents a multi-agent framework for medical AI that enhances clinical query processing by leveraging fine-tuned language models and evidence retrieval mechanisms.
Why It Matters
The integration of AI in healthcare is crucial for improving patient outcomes. This framework addresses significant limitations in current medical AI systems, such as verification and bias, making it a valuable contribution to the field. By enhancing the reliability of AI-generated answers, it can lead to better decision-making in clinical settings.
Key Takeaways
- The framework combines multiple LLMs to improve clinical query processing.
- Fine-tuning on MedQuAD data enhances the quality of medical QA.
- Incorporates evidence retrieval and uncertainty estimation for reliable answers.
- Achieves 87% accuracy and reduces uncertainty through evidence augmentation.
- Includes safety mechanisms to detect bias and ensure factual consistency.
Computer Science > Computation and Language arXiv:2602.14158 (cs) [Submitted on 15 Feb 2026] Title:A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing Authors:Naeimeh Nourmohammadi, Md Meem Hossain, The Anh Han, Safina Showkat Ara, Zia Ush Shamszaman View a PDF of the paper titled A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing, by Naeimeh Nourmohammadi and 4 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) show promise for healthcare question answering, but clinical use is limited by weak verification, insufficient evidence grounding, and unreliable confidence signalling. We propose a multi-agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to improve answer reliability. Our approach has two phases. First, we fine-tune three representative LLM families (GPT, LLaMA, and DeepSeek R1) on MedQuAD-derived medical QA data (20k+ question-answer pairs across multiple NIH domains) and benchmark generation quality. DeepSeek R1 achieves the strongest scores (ROUGE-1 0.536 +- 0.04; ROUGE-2 0.226 +-0.03; BLEU 0.098 -+ 0.018) and substantially outperforms the specialised biomedical baseline BioGPT in zero-shot evaluation. Second, we implement a modular multi-agent pipeline in which a C...