[2406.06512] Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset
About this article
Abstract page for arXiv paper 2406.06512: Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset
Computer Science > Computer Vision and Pattern Recognition arXiv:2406.06512 (cs) [Submitted on 10 Jun 2024 (v1), last revised 4 Mar 2026 (this version, v2)] Title:Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset Authors:Louis Blankemeier, Ashwin Kumar, Joseph Paul Cohen, Jiaming Liu, Longchao Liu, Dave Van Veen, Syed Jamal Safdar Gardezi, Hongkun Yu, Magdalini Paschali, Zhihong Chen, Jean-Benoit Delbrouck, Eduardo Reis, Robbie Holland, Cesar Truyts, Christian Bluethgen, Yufu Wu, Long Lian, Malte Engmann Kjeldskov Jensen, Sophie Ostmeier, Maya Varma, Jeya Maria Jose Valanarasu, Zhongnan Fang, Zepeng Huo, Zaid Nabulsi, Diego Ardila, Wei-Hung Weng, Edson Amaro Junior, Neera Ahuja, Jason Fries, Nigam H. Shah, Greg Zaharchuk, Marc Willis, Adam Yala, Andrew Johnston, Robert D. Boutin, Andrew Wentland, Curtis P. Langlotz, Jason Hom, Sergios Gatidis, Akshay S. Chaudhari View a PDF of the paper titled Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset, by Louis Blankemeier and 39 other authors View PDF HTML (experimental) Abstract:The large volume of abdominal computed tomography (CT) scans coupled with the shortage of radiologists have intensified the need for automated medical image analysis tools. Previous state-of-the-art approaches for automated analysis leverage vision-language models (VLMs) that jointly model images and radiology reports. However, current medical VLMs are generally limited to 2D images and short reports. Here...