Machine Learning Nlp Computer Vision

[2504.00037] ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models

arXiv - AI February 27, 2026 3 min read Article

Summary

The paper introduces ViT-Linearizer, a framework that distills knowledge from Vision Transformers (ViTs) into efficient linear-time models, addressing the challenges of quadratic complexity in high-resolution vision tasks.

Why It Matters

As computer vision models become increasingly complex, the quadratic scaling of ViTs poses significant challenges for real-world applications. ViT-Linearizer offers a solution by enabling faster inference without sacrificing performance, making advanced vision models more accessible and practical for various applications.

Key Takeaways

ViT-Linearizer distills knowledge from ViTs to linear-time models.
The framework uses activation matching and masked prediction for effective distillation.
It significantly improves inference speed for high-resolution tasks.
Achieves competitive performance on ImageNet with 84.3% top-1 accuracy.
Bridges theoretical efficiency with practical applications in large-scale visual tasks.

Computer Science > Computer Vision and Pattern Recognition arXiv:2504.00037 (cs) [Submitted on 30 Mar 2025 (v1), last revised 26 Feb 2026 (this version, v2)] Title:ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models Authors:Guoyizhe Wei, Rama Chellappa View a PDF of the paper titled ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models, by Guoyizhe Wei and 1 other authors View PDF HTML (experimental) Abstract:Vision Transformers (ViTs) have delivered remarkable progress through global self-attention, yet their quadratic complexity can become prohibitive for high-resolution inputs. In this work, we present ViT-Linearizer, a cross-architecture distillation framework that transfers rich ViT representations into a linear-time, recurrent-style model. Our approach leverages 1) activation matching, an intermediate constraint that encourages student to align its token-wise dependencies with those produced by the teacher, and 2) masked prediction, a contextual reconstruction objective that requires the student to predict the teacher's representations for unseen (masked) tokens, to effectively distill the quadratic self-attention knowledge into the student while maintaining efficient complexity. Empirically, our method provides notable speedups particularly for high-resolution tasks, significantly addressing the hardware challenges in inference. Additionally, it also elevates Mamba-based architectures' performance on standard vision ...

Read Original Article

[2504.00037] ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models

Summary

Why It Matters

Key Takeaways

Related Articles

UMKC Announces New Master of Science in Artificial Intelligence

[D] Looking for definition of open-world ish learning problem

Mystery Shopping Meets Machine Learning: Can Algorithms Become the Ultimate Customer Experience Auditor?

GitHub to Use User Data for AI Training by Default

No comments

Stay updated with AI News