[2508.08540] Biased Local SGD for Efficient Deep Learning on Heterogeneous Systems
Summary
This article presents a novel approach to local Stochastic Gradient Descent (SGD) for deep learning on heterogeneous systems, demonstrating significant speed improvements while maintaining accuracy.
Why It Matters
As deep learning increasingly relies on diverse computing resources, optimizing training methods for heterogeneous systems is crucial. This research addresses the common challenge of synchronization overhead in parallel training, offering a solution that enhances efficiency and resource utilization.
Key Takeaways
- Introduces biased local SGD to improve parallel training efficiency.
- Demonstrates up to 32x speed improvements over synchronous SGD.
- Maintains comparable accuracy while utilizing slower CPUs alongside faster GPUs.
- Provides practical insights for optimizing diverse computing resources.
- Addresses a significant challenge in deep learning training methodologies.
Computer Science > Machine Learning arXiv:2508.08540 (cs) [Submitted on 12 Aug 2025 (v1), last revised 23 Feb 2026 (this version, v3)] Title:Biased Local SGD for Efficient Deep Learning on Heterogeneous Systems Authors:Jihyun Lim, Junhyuk Jo, Chanhyeok Ko, Young Min Go, Jimin Hwa, Sunwoo Lee View a PDF of the paper titled Biased Local SGD for Efficient Deep Learning on Heterogeneous Systems, by Jihyun Lim and 5 other authors View PDF HTML (experimental) Abstract:Most parallel neural network training methods assume homogeneous computing resources. For example, synchronous data-parallel SGD suffers from significant synchronization overhead under heterogeneous workloads, often forcing practitioners to rely only on the fastest devices (e.g., GPUs). In this work, we study local SGD for efficient parallel training on heterogeneous systems. We show that intentionally introducing bias in data sampling and model aggregation can effectively harmonize slower CPUs with faster GPUs. Our extensive empirical results demonstrate that a carefully controlled bias significantly accelerates local SGD while achieving comparable or even higher accuracy than synchronous SGD under the same epoch budget. For instance, our method trains ResNet20 on CIFAR-10 with 2 CPUs and 8 GPUs up to 32x faster than synchronous SGD, with nearly identical accuracy. These results provide practical insights into how to flexibly utilize diverse compute resources for deep learning. Subjects: Machine Learning (cs.LG) C...