[2602.14615] VariViT: A Vision Transformer for Variable Image Sizes

[2602.14615] VariViT: A Vision Transformer for Variable Image Sizes

arXiv - AI 4 min read Article

Summary

The paper introduces VariViT, a Vision Transformer designed to effectively handle variable image sizes, improving feature representation in medical imaging tasks.

Why It Matters

VariViT addresses significant challenges in medical imaging, where fixed-size input constraints can lead to information loss and inefficiencies. By allowing variable image sizes, it enhances diagnostic capabilities and computational efficiency, making it relevant for researchers and practitioners in computer vision and healthcare.

Key Takeaways

  • VariViT improves representation learning by accommodating variable image sizes.
  • The model employs a novel positional embedding resizing scheme for better feature extraction.
  • A new batching strategy reduces computational complexity, enhancing training and inference speed.
  • VariViT outperforms traditional ViTs and ResNet in glioma genotype prediction and brain tumor classification.
  • Achieves significant F1-scores, demonstrating its effectiveness in medical imaging applications.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.14615 (cs) [Submitted on 16 Feb 2026] Title:VariViT: A Vision Transformer for Variable Image Sizes Authors:Aswathi Varma, Suprosanna Shit, Chinmay Prabhakar, Daniel Scholz, Hongwei Bran Li, Bjoern Menze, Daniel Rueckert, Benedikt Wiestler View a PDF of the paper titled VariViT: A Vision Transformer for Variable Image Sizes, by Aswathi Varma and 7 other authors View PDF HTML (experimental) Abstract:Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning, leveraging self-attention mechanisms to excel in various tasks. ViTs split images into fixed-size patches, constraining them to a predefined size and necessitating pre-processing steps like resizing, padding, or cropping. This poses challenges in medical imaging, particularly with irregularly shaped structures like tumors. A fixed bounding box crop size produces input images with highly variable foreground-to-background ratios. Resizing medical images can degrade information and introduce artefacts, impacting diagnosis. Hence, tailoring variable-sized crops to regions of interest can enhance feature representation capabilities. Moreover, large images are computationally expensive, and smaller sizes risk information loss, presenting a computation-accuracy tradeoff. We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size. VariViT employs a nove...

Related Articles

Machine Learning

[D] I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Machine Learning · 1 min ·
Machine Learning

I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Artificial Intelligence · 1 min ·
AI benchmarks are broken. Here’s what we need instead. | MIT Technology Review
Machine Learning

AI benchmarks are broken. Here’s what we need instead. | MIT Technology Review

One-off tests don’t measure AI’s true impact. We’re better off shifting to more human-centered, context-specific methods.

MIT Technology Review · 8 min ·
Machine Learning

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

[D] Ive been trying to understand the technical setup of a project called Qubic. It claims to use distributed proof of work computing for...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime