[2601.07326] Convergence Rate Analysis of the AdamW-Style Shampoo:

[2601.07326] Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

arXiv - Machine Learning May 04, 2026 3 min read

About this article

Abstract page for arXiv paper 2601.07326: Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

Mathematics > Optimization and Control arXiv:2601.07326 (math) [Submitted on 12 Jan 2026 (v1), last revised 1 May 2026 (this version, v2)] Title:Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning Authors:Huan Li, Yiming Dong, Zhouchen Lin View a PDF of the paper titled Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning, by Huan Li and 1 other authors View PDF HTML (experimental) Abstract:This paper studies the AdamW-style Shampoo optimizer, an effective implementation of classical Shampoo that notably won the external tuning track of the AlgoPerf neural network training algorithm competition. Our analysis unifies one-sided and two-sided preconditioning and establishes the convergence rate $\frac{1}{K}\sum_{k=1}^K E\left[\|\nabla f(X_k)\|_*\right]\leq O(\frac{\sqrt{m+n}C}{K^{1/4}})$ measured by nuclear norm, where $K$ represents the iteration number, $(m,n)$ denotes the size of matrix parameters, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $\|\nabla f(X)\|_F\leq \|\nabla f(X)\|_*\leq \sqrt{m+n}\|\nabla f(X)\|_F$, supporting that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum_{k=1}^KE\left[\|\nabla f(X_k)\|_F\right]\leq O(\frac{C}{K^{1/4}})$ convergence rate of SGD in the ideal case of $\|\nabla f(X)\|_*= \Theta(\sqrt{m+n})\|\nabla f(X)\|_F$. Subjects: Optimization and Control (math.OC)...

Originally published on May 04, 2026. Curated by AI News.

Machine Learning

I think I just built something that actually reasons like a human mind… possibly even better

Hey guys, So I’ve been messing around with a weird side project for a while, just experimenting with a completely different approach to b...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Llms

Excellent discussion about LLM scaling [D]

I came across an excellent in depth discussion of memory and compute scaling analysis for LLMs. One takeaway is that running LLMs locally...

Reddit - Machine Learning · 1 min · about 2 hours ago

Llms

[2602.03216] Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

Abstract page for arXiv paper 2602.03216: Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

arXiv - Machine Learning · 4 min · about 3 hours ago

Llms

[2601.21214] Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models

Abstract page for arXiv paper 2601.21214: Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Larg...

arXiv - Machine Learning · 4 min · about 3 hours ago

[2601.07326] Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

About this article

Related Articles

I think I just built something that actually reasons like a human mind… possibly even better

Excellent discussion about LLM scaling [D]

[2602.03216] Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

[2601.21214] Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models

No comments

Stay updated with AI News