[2604.04133] Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks
About this article
Abstract page for arXiv paper 2604.04133: Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks
Computer Science > Computer Vision and Pattern Recognition arXiv:2604.04133 (cs) [Submitted on 5 Apr 2026] Title:Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks Authors:Rubén Moreno-Aguado, Alba Magallón, Victor Moreno, Yingying Fang, Guang Yang View a PDF of the paper titled Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks, by Rub\'en Moreno-Aguado and 3 other authors View PDF HTML (experimental) Abstract:There is substantial interest in developing artificial intelligence systems to support radiologists across tasks ranging from segmentation to report generation. Existing computed tomography (CT) foundation models have largely focused on building generalist vision-language systems capable of tasks such as question answering and report generation. However, training reliable vision-language systems requires paired image-text data at a scale that remains unavailable in CT. Moreover, adapting the underlying visual representations to downstream tasks typically requires partial or full backbone fine-tuning, a computationally demanding process inaccessible to many research groups. Instead, foundation models should prioritise learning robust visual representations that enable efficient transfer to new tasks with minimal labelled data and without backbone fine-tuning. We present VoxelFM, a 3D CT foundation model trained with self-distillation using the DINO fr...