[2504.00037] ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models
Summary
The paper introduces ViT-Linearizer, a framework that distills knowledge from Vision Transformers (ViTs) into efficient linear-time models, addressing the challenges of quadratic complexity in high-resolution vision tasks.
Why It Matters
As computer vision models become increasingly complex, the quadratic scaling of ViTs poses significant challenges for real-world applications. ViT-Linearizer offers a solution by enabling faster inference without sacrificing performance, making advanced vision models more accessible and practical for various applications.
Key Takeaways
- ViT-Linearizer distills knowledge from ViTs to linear-time models.
- The framework uses activation matching and masked prediction for effective distillation.
- It significantly improves inference speed for high-resolution tasks.
- Achieves competitive performance on ImageNet with 84.3% top-1 accuracy.
- Bridges theoretical efficiency with practical applications in large-scale visual tasks.
Computer Science > Computer Vision and Pattern Recognition arXiv:2504.00037 (cs) [Submitted on 30 Mar 2025 (v1), last revised 26 Feb 2026 (this version, v2)] Title:ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models Authors:Guoyizhe Wei, Rama Chellappa View a PDF of the paper titled ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models, by Guoyizhe Wei and 1 other authors View PDF HTML (experimental) Abstract:Vision Transformers (ViTs) have delivered remarkable progress through global self-attention, yet their quadratic complexity can become prohibitive for high-resolution inputs. In this work, we present ViT-Linearizer, a cross-architecture distillation framework that transfers rich ViT representations into a linear-time, recurrent-style model. Our approach leverages 1) activation matching, an intermediate constraint that encourages student to align its token-wise dependencies with those produced by the teacher, and 2) masked prediction, a contextual reconstruction objective that requires the student to predict the teacher's representations for unseen (masked) tokens, to effectively distill the quadratic self-attention knowledge into the student while maintaining efficient complexity. Empirically, our method provides notable speedups particularly for high-resolution tasks, significantly addressing the hardware challenges in inference. Additionally, it also elevates Mamba-based architectures' performance on standard vision ...