[2602.19982] A Computationally Efficient Multidimensional Vision Transformer
Summary
This paper presents a novel tensor-based framework for Vision Transformers, enhancing computational efficiency while maintaining competitive accuracy in computer vision tasks.
Why It Matters
As Vision Transformers are increasingly used in computer vision, their high computational and memory demands pose challenges for practical applications. This research addresses these limitations by introducing a more efficient architecture, potentially broadening the accessibility and deployment of advanced AI models in various fields.
Key Takeaways
- Introduces a tensor-based framework for Vision Transformers.
- Achieves a uniform 1/C parameter reduction in computational costs.
- Maintains competitive accuracy on standard benchmarks.
- Explores the algebraic properties of the tensor cosine product.
- Enhances attention mechanisms and structured feature representations.
Computer Science > Machine Learning arXiv:2602.19982 (cs) [Submitted on 23 Feb 2026] Title:A Computationally Efficient Multidimensional Vision Transformer Authors:Alaa El Ichi, Khalide Jbilou View a PDF of the paper titled A Computationally Efficient Multidimensional Vision Transformer, by Alaa El Ichi and Khalide Jbilou View PDF HTML (experimental) Abstract:Vision Transformers have achieved state-of-the-art performance in a wide range of computer vision tasks, but their practical deployment is limited by high computational and memory costs. In this paper, we introduce a novel tensor-based framework for Vision Transformers built upon the Tensor Cosine Product (Cproduct). By exploiting multilinear structures inherent in image data and the orthogonality of cosine transforms, the proposed approach enables efficient attention mechanisms and structured feature representations. We develop the theoretical foundations of the tensor cosine product, analyze its algebraic properties, and integrate it into a new Cproduct-based Vision Transformer architecture (TCP-ViT). Numerical experiments on standard classification and segmentation benchmarks demonstrate that the proposed method achieves a uniform 1/C parameter reduction (where C is the number of channels) while maintaining competitive accuracy. Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2602.19982 [cs.LG] (or arXiv:2602.19982v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.19...