[2602.13413] Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise

[2602.13413] Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise

arXiv - Machine Learning 4 min read Article

Summary

This article presents a worst-case complexity theory for stochastically preconditioned stochastic gradient descent (SPSGD) under heavy-tailed noise, highlighting the advantages of normalization over clipping in training stability.

Why It Matters

Understanding the theoretical foundations of normalization in SGD is crucial for improving training efficiency in machine learning models, particularly in scenarios with heavy-tailed noise. This research provides insights that can enhance the performance of widely used adaptive methods like Adam and RMSProp.

Key Takeaways

  • Normalization ensures convergence in SPSGD under heavy-tailed noise, outperforming clipping methods.
  • The paper establishes convergence rates for normalized SGD, providing a theoretical basis for its empirical success.
  • A novel vector-valued Burkholder-type inequality is introduced, which may have broader applications in optimization.

Computer Science > Machine Learning arXiv:2602.13413 (cs) [Submitted on 13 Feb 2026] Title:Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise Authors:Yuchen Fang, James Demmel, Javad Lavaei View a PDF of the paper titled Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise, by Yuchen Fang and 2 other authors View PDF Abstract:We develop a worst-case complexity theory for stochastically preconditioned stochastic gradient descent (SPSGD) and its accelerated variants under heavy-tailed noise, a setting that encompasses widely used adaptive methods such as Adam, RMSProp, and Shampoo. We assume the stochastic gradient noise has a finite $p$-th moment for some $p \in (1,2]$, and measure convergence after $T$ iterations. While clipping and normalization are parallel tools for stabilizing training of SGD under heavy-tailed noise, there is a fundamental separation in their worst-case properties in stochastically preconditioned settings. We demonstrate that normalization guarantees convergence to a first-order stationary point at rate $\mathcal{O}(T^{-\frac{p-1}{3p-2}})$ when problem parameters are known, and $\mathcal{O}(T^{-\frac{p-1}{2p}})$ when problem parameters are unknown, matching the optimal rates for normalized SGD, respectively. In contrast, we prove that clipping may fail to converge in the worst case due to the statistical d...

Related Articles

UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
University of Tartu thesis: transfer learning boosts Estonian AI models
Machine Learning

University of Tartu thesis: transfer learning boosts Estonian AI models

AI News - General · 4 min ·
Machine Learning

AI model suggests CPAP can massively swing heart risk in sleep apnea

AI News - General · 1 min ·
Machine Learning

COD expands AI education with degree and machine learning certificate

AI News - General ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime