[2604.01563] Does Your Optimizer Care How You Normalize?

[2604.01563] Does Your Optimizer Care How You Normalize? Normalization-Optimizer Coupling in LLM Training

arXiv - Machine Learning April 03, 2026 3 min read

About this article

Abstract page for arXiv paper 2604.01563: Does Your Optimizer Care How You Normalize? Normalization-Optimizer Coupling in LLM Training

Computer Science > Artificial Intelligence arXiv:2604.01563 (cs) [Submitted on 2 Apr 2026] Title:Does Your Optimizer Care How You Normalize? Normalization-Optimizer Coupling in LLM Training Authors:Abdelrahman Abouzeid (Georgia Institute of Technology) View a PDF of the paper titled Does Your Optimizer Care How You Normalize? Normalization-Optimizer Coupling in LLM Training, by Abdelrahman Abouzeid (Georgia Institute of Technology) View PDF HTML (experimental) Abstract:In LLM training, normalization layers and optimizers are typically treated as independent design choices. In a 3x2 factorial at 1B parameters and 1000 training steps, we show this assumption can fail: Dynamic Erf (Derf; Chen & Liu, 2025) suffers a large negative interaction with Muon (Jordan, 2024), with its gap to RMSNorm growing from +0.31 nats under AdamW to +0.97 under Muon, approximately three times larger. Dynamic Tanh (DyT; Zhu et al., 2025), included as a bounded-normalizer control, shows no such penalty. Our evidence points to two failure modes of erf under Muon's faster spectral-norm growth: saturation (lossy compression) and scale blindness (discarding activation magnitude). An EMA-blend that reintroduces running scale estimates recovers ~84% of the gap. Separately, reducing Derf's alpha from its published default (0.5 to 0.3) recovers ~80% by keeping erf in its near-linear regime, where it approximately preserves relative scale; this setting is not the published default of Chen & Liu (2025). Usin...

Originally published on April 03, 2026. Curated by AI News.

Llms

Earnestly using Claude to create a shared drive hierarchy and manual maintenance plan = LOL

On a less serious (but perhaps profound?) note: Some guys I know recently decided to use AI for the first time in their lives, while sett...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Llms

OpenAI now lets teams make custom bots that can do work on their own | The Verge

OpenAI is bringing “workspace” AI agents to users of its Business, Enterprise, Edu, and Teachers plans that can perform business tasks in...

The Verge - AI · 4 min · about 4 hours ago

Llms

My Unsupervised Compliance Layer Project

A bit of context, my work has been mostly around building agentic pipelines. I really love the craft. My latest side project was a delibe...

Reddit - Artificial Intelligence · 1 min · about 4 hours ago

Llms

I’m 17 and built an AI that flirts, remembers you, watches your shows, and replies to your reels…

V3 is done and it’s getting… weird. This thing now: auto-replies to DMs with tone adjustment reads images, transcribes voice notes, repli...

Reddit - Artificial Intelligence · 1 min · about 4 hours ago

[2604.01563] Does Your Optimizer Care How You Normalize? Normalization-Optimizer Coupling in LLM Training

About this article

Related Articles

Earnestly using Claude to create a shared drive hierarchy and manual maintenance plan = LOL

OpenAI now lets teams make custom bots that can do work on their own | The Verge

My Unsupervised Compliance Layer Project

I’m 17 and built an AI that flirts, remembers you, watches your shows, and replies to your reels…

No comments

Stay updated with AI News