[2503.04121] Simple Self Organizing Map with Vision Transformers
Summary
This paper explores the integration of Self-Organizing Maps (SOMs) with Vision Transformers (ViTs) to enhance performance on small datasets, demonstrating their synergistic capabilities in both unsupervised and supervised tasks.
Why It Matters
The study addresses a significant gap in the application of ViTs, which often struggle with smaller datasets due to a lack of inductive biases. By combining SOMs with ViTs, the research proposes a novel approach that could improve machine learning outcomes, particularly in scenarios where data is limited. This has implications for various fields within computer vision and artificial intelligence, making it a relevant contribution to ongoing research.
Key Takeaways
- ViTs can underperform on small datasets due to limited inductive biases.
- SOMs provide a structured approach to preserve topology and spatial organization.
- Combining SOMs with ViTs can enhance performance in both unsupervised and supervised tasks.
- The research fills a critical gap in existing literature regarding the integration of these architectures.
- Code for the proposed methods is publicly available, promoting further exploration.
Computer Science > Computer Vision and Pattern Recognition arXiv:2503.04121 (cs) [Submitted on 6 Mar 2025 (v1), last revised 18 Feb 2026 (this version, v4)] Title:Simple Self Organizing Map with Vision Transformers Authors:Alan Luo, Kaiwen Yuan View a PDF of the paper titled Simple Self Organizing Map with Vision Transformers, by Alan Luo and 1 other authors View PDF HTML (experimental) Abstract:Vision Transformers (ViTs) have demonstrated exceptional performance in various vision tasks. However, they tend to underperform on smaller datasets due to their inherent lack of inductive biases. Current approaches address this limitation implicitly-often by pairing ViTs with pretext tasks or by distilling knowledge from convolutional neural networks (CNNs) to strengthen the prior. In contrast, Self-Organizing Maps (SOMs), a widely adopted self-supervised framework, are inherently structured to preserve topology and spatial organization, making them a promising candidate to directly address the limitations of ViTs in limited or small training datasets. Despite this potential, equipping SOMs with modern deep learning architectures remains largely unexplored. In this study, we conduct a novel exploration on how Vision Transformers (ViTs) and Self-Organizing Maps (SOMs) can empower each other, aiming to bridge this critical research gap. Our findings demonstrate that these architectures can synergistically enhance each other, leading to significantly improved performance in both unsu...