[2510.25166] A Study on Inference Latency for Vision Transformers on Mobile Devices
Summary
This study quantitatively analyzes the inference latency of 190 vision transformers (ViTs) on mobile devices, comparing them to 102 convolutional neural networks (CNNs) and providing insights into latency factors.
Why It Matters
As mobile devices increasingly utilize advanced machine learning techniques, understanding the performance of vision transformers is crucial for optimizing applications in real-world scenarios. This research offers valuable data that can inform developers and researchers about the efficiency of ViTs compared to traditional CNNs, impacting future mobile AI deployments.
Key Takeaways
- The study evaluates the latency of 190 ViTs on mobile devices.
- It compares ViTs with 102 CNNs to highlight performance differences.
- A dataset of 1000 synthetic ViTs was developed to predict inference latency accurately.
- Insights from this research can guide the design of more efficient mobile AI applications.
- Understanding latency factors is essential for optimizing real-world applications.
Computer Science > Computer Vision and Pattern Recognition arXiv:2510.25166 (cs) [Submitted on 29 Oct 2025 (v1), last revised 18 Feb 2026 (this version, v2)] Title:A Study on Inference Latency for Vision Transformers on Mobile Devices Authors:Zhuojin Li, Marco Paolieri, Leana Golubchik View a PDF of the paper titled A Study on Inference Latency for Vision Transformers on Mobile Devices, by Zhuojin Li and 2 other authors View PDF HTML (experimental) Abstract:Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Performance (cs.PF) Cite as: arXiv:2510.25166 [cs.CV] (or arXiv:2510.25166v2 [cs.CV] for this version) https://doi.org/10.48550/arXiv.25...