[2602.16320] RefineFormer3D: Efficient 3D Medical Image Segmentation via Adaptive Multi-Scale Transformer with Cross Attention Fusion
Summary
RefineFormer3D presents a lightweight transformer architecture for 3D medical image segmentation, achieving high accuracy with significantly fewer parameters than existing methods.
Why It Matters
The study addresses the critical challenge of efficient 3D medical image segmentation, which is essential for clinical workflows. By proposing a model that balances accuracy and computational efficiency, it enhances the feasibility of deploying advanced AI in resource-constrained medical settings.
Key Takeaways
- RefineFormer3D achieves 93.44% and 85.9% average Dice scores on ACDC and BraTS benchmarks, respectively.
- The model utilizes only 2.94M parameters, making it significantly lighter than contemporary transformer methods.
- Fast inference time of 8.35 ms per volume on GPU supports its use in clinical environments.
- Key components include GhostConv3D for feature extraction and a cross-attention fusion decoder.
- The architecture is designed for practical deployment in resource-limited settings.
Electrical Engineering and Systems Science > Image and Video Processing arXiv:2602.16320 (eess) [Submitted on 18 Feb 2026] Title:RefineFormer3D: Efficient 3D Medical Image Segmentation via Adaptive Multi-Scale Transformer with Cross Attention Fusion Authors:Kavyansh Tyagi, Vishwas Rathi, Puneet Goyal View a PDF of the paper titled RefineFormer3D: Efficient 3D Medical Image Segmentation via Adaptive Multi-Scale Transformer with Cross Attention Fusion, by Kavyansh Tyagi and 2 other authors View PDF Abstract:Accurate and computationally efficient 3D medical image segmentation remains a critical challenge in clinical workflows. Transformer-based architectures often demonstrate superior global contextual modeling but at the expense of excessive parameter counts and memory demands, restricting their clinical deployment. We propose RefineFormer3D, a lightweight hierarchical transformer architecture that balances segmentation accuracy and computational efficiency for volumetric medical imaging. The architecture integrates three key components: (i) GhostConv3D-based patch embedding for efficient feature extraction with minimal redundancy, (ii) MixFFN3D module with low-rank projections and depthwise convolutions for parameter-efficient feature extraction, and (iii) a cross-attention fusion decoder enabling adaptive multi-scale skip connection integration. RefineFormer3D contains only 2.94M parameters, substantially fewer than contemporary transformer-based methods. Extensive experim...