[2602.13413] Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise

arXiv - Machine Learning February 17, 2026 4 min read Article

Summary

This article presents a worst-case complexity theory for stochastically preconditioned stochastic gradient descent (SPSGD) under heavy-tailed noise, highlighting the advantages of normalization over clipping in training stability.

Why It Matters

Understanding the theoretical foundations of normalization in SGD is crucial for improving training efficiency in machine learning models, particularly in scenarios with heavy-tailed noise. This research provides insights that can enhance the performance of widely used adaptive methods like Adam and RMSProp.

Key Takeaways

Normalization ensures convergence in SPSGD under heavy-tailed noise, outperforming clipping methods.
The paper establishes convergence rates for normalized SGD, providing a theoretical basis for its empirical success.
A novel vector-valued Burkholder-type inequality is introduced, which may have broader applications in optimization.

Computer Science > Machine Learning arXiv:2602.13413 (cs) [Submitted on 13 Feb 2026] Title:Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise Authors:Yuchen Fang, James Demmel, Javad Lavaei View a PDF of the paper titled Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise, by Yuchen Fang and 2 other authors View PDF Abstract:We develop a worst-case complexity theory for stochastically preconditioned stochastic gradient descent (SPSGD) and its accelerated variants under heavy-tailed noise, a setting that encompasses widely used adaptive methods such as Adam, RMSProp, and Shampoo. We assume the stochastic gradient noise has a finite $p$-th moment for some $p \in (1,2]$, and measure convergence after $T$ iterations. While clipping and normalization are parallel tools for stabilizing training of SGD under heavy-tailed noise, there is a fundamental separation in their worst-case properties in stochastically preconditioned settings. We demonstrate that normalization guarantees convergence to a first-order stationary point at rate $\mathcal{O}(T^{-\frac{p-1}{3p-2}})$ when problem parameters are known, and $\mathcal{O}(T^{-\frac{p-1}{2p}})$ when problem parameters are unknown, matching the optimal rates for normalized SGD, respectively. In contrast, we prove that clipping may fail to converge in the worst case due to the statistical d...

Read Original Article

[2602.13413] Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise

Summary

Why It Matters

Key Takeaways

Related Articles

UMKC Announces New Master of Science in Artificial Intelligence

University of Tartu thesis: transfer learning boosts Estonian AI models

AI model suggests CPAP can massively swing heart risk in sleep apnea

COD expands AI education with degree and machine learning certificate

No comments

Stay updated with AI News