[2510.21081] Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution

[2510.21081] Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution

arXiv - Machine Learning 4 min read Article

Summary

This paper presents a novel approach to accelerate mobile inference by utilizing fine-grained CPU-GPU co-execution, addressing synchronization challenges and execution time prediction.

Why It Matters

As mobile devices increasingly rely on deep learning models, optimizing their performance is crucial. This research proposes solutions to overcome significant barriers in mobile inference, potentially leading to faster and more efficient applications in real-world scenarios.

Key Takeaways

  • Proposes a lightweight synchronization mechanism using OpenCL SVM.
  • Introduces machine learning models to predict CPU-GPU execution times.
  • Achieves up to 1.89x speedup for linear layers and 1.75x for convolutional layers.
  • Evaluated on multiple mobile platforms, demonstrating practical applicability.
  • Addresses critical challenges in deploying deep neural networks on mobile devices.

Computer Science > Machine Learning arXiv:2510.21081 (cs) [Submitted on 24 Oct 2025 (v1), last revised 18 Feb 2026 (this version, v2)] Title:Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution Authors:Zhuojin Li, Marco Paolieri, Leana Golubchik View a PDF of the paper titled Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution, by Zhuojin Li and 2 other authors View PDF HTML (experimental) Abstract:Deploying deep neural networks on mobile devices is increasingly important but remains challenging due to limited computing resources. On the other hand, their unified memory architecture and narrower gap between CPU and GPU performance provide an opportunity to reduce inference latency by assigning tasks to both CPU and GPU. The main obstacles for such collaborative execution are the significant synchronization overhead required to combine partial results, and the difficulty of predicting execution times of tasks assigned to CPU and GPU (due to the dynamic selection of implementations and parallelism level). To overcome these obstacles, we propose both a lightweight synchronization mechanism based on OpenCL fine-grained shared virtual memory (SVM) and machine learning models to accurately predict execution times. Notably, these models capture the performance characteristics of GPU kernels and account for their dispatch times. A comprehensive evaluation on four mobile platforms shows that our approach can quickly select CPU-GPU co-execution...

Related Articles

Llms

[P] Building a LLM from scratch with Mary Shelley's "Frankenstein" (on Kaggle)

Notebook on GitHub: https://github.com/Buzzpy/Python-Machine-Learning-Models/blob/main/Frankenstein/train-frankenstein.ipynb submitted by...

Reddit - Machine Learning · 1 min ·
Machine Learning

[D] How are reviewers able to get away without providing acknowledgement in ICML 2026?

Today officially marks the end of the author-reviewer discussion period. The acknowledgement deadline has already passed by over 3 days a...

Reddit - Machine Learning · 1 min ·
Llms

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

https://arxiv.org/abs/2604.05091 Abstract: "We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large l...

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

Fresher ML/DL Engineer actively looking for entry-level Data Scientist & ML Engineer roles

submitted by /u/SavingsPromise5993 [link] [comments]

Reddit - ML Jobs · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime