[2512.13742] DL$^3$M: A Vision-to-Language Framework for Expert-Level Medical Reasoning through Deep Learning and Large Language Models
Summary
The DL$^3$M framework integrates deep learning and large language models to enhance medical reasoning from images, addressing limitations in current AI models for clinical applications.
Why It Matters
This research highlights the potential of combining deep learning with language models to improve clinical reasoning in medical diagnostics. It underscores the importance of reliable AI in high-stakes medical environments, where accurate explanations are crucial for decision-making.
Key Takeaways
- DL$^3$M links image classification with structured clinical reasoning.
- MobileCoAtNet achieves high accuracy in classifying gastrointestinal diseases.
- Current LLMs struggle with stability and reliability in medical reasoning.
- Expert-verified benchmarks were created to evaluate LLM reasoning.
- The framework provides insights into the limitations of AI in medical contexts.
Computer Science > Computer Vision and Pattern Recognition arXiv:2512.13742 (cs) This paper has been withdrawn by Md. Najib Hasan [Submitted on 14 Dec 2025 (v1), last revised 22 Feb 2026 (this version, v2)] Title:DL$^3$M: A Vision-to-Language Framework for Expert-Level Medical Reasoning through Deep Learning and Large Language Models Authors:Md. Najib Hasan (1), Imran Ahmad (1), Sourav Basak Shuvo (2), Md. Mahadi Hasan Ankon (2), Sunanda Das (3), Nazmul Siddique (4), Hui Wang (5) ((1) Wichita State University, USA, (2) Khulna University of Engineering and Technology, Bangladesh, (3) University of Arkansas, USA, (4) Ulster University, UK, (5) Queen's University Belfast, UK) View a PDF of the paper titled DL$^3$M: A Vision-to-Language Framework for Expert-Level Medical Reasoning through Deep Learning and Large Language Models, by Md. Najib Hasan (1) and 15 other authors No PDF available, click to view other formats Abstract:Medical image classifiers detect gastrointestinal diseases well, but they do not explain their decisions. Large language models can generate clinical text, yet they struggle with visual reasoning and often produce unstable or incorrect explanations. This leaves a gap between what a model sees and the type of reasoning a clinician expects. We introduce a framework that links image classification with structured clinical reasoning. A new hybrid model, MobileCoAtNet, is designed for endoscopic images and achieves high accuracy across eight stomach-related cl...