[2412.07971] Effectiveness of Distributed Gradient Descent with Local Steps for Overparameterized Models
About this article
Abstract page for arXiv paper 2412.07971: Effectiveness of Distributed Gradient Descent with Local Steps for Overparameterized Models
Computer Science > Machine Learning arXiv:2412.07971 (cs) [Submitted on 10 Dec 2024 (v1), last revised 21 Mar 2026 (this version, v2)] Title:Effectiveness of Distributed Gradient Descent with Local Steps for Overparameterized Models Authors:Heng Zhu, Harsh Vardhan, Arya Mazumdar View a PDF of the paper titled Effectiveness of Distributed Gradient Descent with Local Steps for Overparameterized Models, by Heng Zhu and 2 other authors View PDF HTML (experimental) Abstract:In distributed training of machine learning models, gradient descent with local iterative steps, commonly known as Local (Stochastic) Gradient Descent (Local-(S)GD) or Federated averaging (FedAvg), is a very popular method to mitigate communication burden. In this method, gradient steps based on local datasets are taken independently in distributed compute nodes to update the local models, which are then aggregated intermittently. In the interpolation regime, Local-GD can converge to zero training loss. However, with many potential solutions corresponding to zero training loss, it is not known which solution Local-GD converges to. In this work we answer this question by analyzing implicit bias of Local-GD for classification tasks with linearly separable data. For the interpolation regime, our analysis shows that the aggregated global model obtained from Local-GD, with arbitrary number of local steps, converges exactly to the model that would be obtained if all data were in one place (centralized model) ''in ...