[2602.20646] On the Convergence of Stochastic Gradient Descent with Perturbed Forward-Backward Passes
Summary
This paper analyzes the convergence of Stochastic Gradient Descent (SGD) under perturbations in both forward and backward passes, providing theoretical insights and experimental validations.
Why It Matters
Understanding how perturbations affect SGD is crucial for improving optimization techniques in machine learning, particularly in deep learning where gradient noise can lead to instability. This research offers a theoretical framework that can help practitioners better manage training dynamics.
Key Takeaways
- Perturbations in SGD can propagate and amplify through computational graphs, affecting convergence.
- The paper provides convergence guarantees for non-convex objectives and conditions for stability.
- Experimental results validate the theoretical findings, illustrating the behavior of gradient spikes.
Mathematics > Optimization and Control arXiv:2602.20646 (math) [Submitted on 24 Feb 2026] Title:On the Convergence of Stochastic Gradient Descent with Perturbed Forward-Backward Passes Authors:Boao Kong, Hengrui Zhang, Kun Yuan View a PDF of the paper titled On the Convergence of Stochastic Gradient Descent with Perturbed Forward-Backward Passes, by Boao Kong and 2 other authors View PDF HTML (experimental) Abstract:We study stochastic gradient descent (SGD) for composite optimization problems with $N$ sequential operators subject to perturbations in both the forward and backward passes. Unlike classical analyses that treat gradient noise as additive and localized, perturbations to intermediate outputs and gradients cascade through the computational graph, compounding geometrically with the number of operators. We present the first comprehensive theoretical analysis of this setting. Specifically, we characterize how forward and backward perturbations propagate and amplify within a single gradient step, derive convergence guarantees for both general non-convex objectives and functions satisfying the Polyak--Łojasiewicz condition, and identify conditions under which perturbations do not deteriorate the asymptotic convergence order. As a byproduct, our analysis furnishes a theoretical explanation for the gradient spiking phenomenon widely observed in deep learning, precisely characterizing the conditions under which training recovers from spikes or diverges. Experiments on lo...