[2601.07326] Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

[2601.07326] Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

arXiv - Machine Learning 3 min read

About this article

Abstract page for arXiv paper 2601.07326: Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

Mathematics > Optimization and Control arXiv:2601.07326 (math) [Submitted on 12 Jan 2026 (v1), last revised 1 May 2026 (this version, v2)] Title:Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning Authors:Huan Li, Yiming Dong, Zhouchen Lin View a PDF of the paper titled Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning, by Huan Li and 1 other authors View PDF HTML (experimental) Abstract:This paper studies the AdamW-style Shampoo optimizer, an effective implementation of classical Shampoo that notably won the external tuning track of the AlgoPerf neural network training algorithm competition. Our analysis unifies one-sided and two-sided preconditioning and establishes the convergence rate $\frac{1}{K}\sum_{k=1}^K E\left[\|\nabla f(X_k)\|_*\right]\leq O(\frac{\sqrt{m+n}C}{K^{1/4}})$ measured by nuclear norm, where $K$ represents the iteration number, $(m,n)$ denotes the size of matrix parameters, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $\|\nabla f(X)\|_F\leq \|\nabla f(X)\|_*\leq \sqrt{m+n}\|\nabla f(X)\|_F$, supporting that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum_{k=1}^KE\left[\|\nabla f(X_k)\|_F\right]\leq O(\frac{C}{K^{1/4}})$ convergence rate of SGD in the ideal case of $\|\nabla f(X)\|_*= \Theta(\sqrt{m+n})\|\nabla f(X)\|_F$. Subjects: Optimization and Control (math.OC)...

Originally published on May 04, 2026. Curated by AI News.

Related Articles

Machine Learning

I think I just built something that actually reasons like a human mind… possibly even better

Hey guys, So I’ve been messing around with a weird side project for a while, just experimenting with a completely different approach to b...

Reddit - Artificial Intelligence · 1 min ·
Llms

Excellent discussion about LLM scaling [D]

I came across an excellent in depth discussion of memory and compute scaling analysis for LLMs. One takeaway is that running LLMs locally...

Reddit - Machine Learning · 1 min ·
[2602.03216] Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection
Llms

[2602.03216] Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

Abstract page for arXiv paper 2602.03216: Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

arXiv - Machine Learning · 4 min ·
[2601.21214] Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models
Llms

[2601.21214] Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models

Abstract page for arXiv paper 2601.21214: Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Larg...

arXiv - Machine Learning · 4 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime