[2510.12402] Cautious Weight Decay

[2510.12402] Cautious Weight Decay

arXiv - Machine Learning 3 min read Article

Summary

The paper introduces Cautious Weight Decay (CWD), a novel optimizer-agnostic method that selectively applies weight decay during optimization, enhancing performance in large-scale machine learning tasks.

Why It Matters

CWD offers a significant advancement in optimization techniques by improving loss and accuracy without requiring additional hyperparameters. This can lead to better performance in various machine learning applications, particularly in language models and image classification, making it relevant for researchers and practitioners in the field.

Key Takeaways

  • CWD modifies weight decay to align with optimizer updates, enhancing optimization.
  • It is compatible with popular optimizers like AdamW and requires no new hyperparameters.
  • CWD improves performance in language model pre-training and ImageNet classification.
  • The method preserves the original loss function, allowing for a bilevel optimization interpretation.
  • CWD facilitates the search for locally Pareto-optimal stationary points.

Computer Science > Machine Learning arXiv:2510.12402 (cs) [Submitted on 14 Oct 2025 (v1), last revised 24 Feb 2026 (this version, v2)] Title:Cautious Weight Decay Authors:Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, Qiang Liu View a PDF of the paper titled Cautious Weight Decay, by Lizhang Chen and 8 other authors View PDF HTML (experimental) Abstract:We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2510.12402 [cs.LG]   (or arXiv:2510.12402v2 [cs.LG] for this version)   https://doi.org/10.48550/arXiv.2510.12402 Focus to learn more arXiv-issued DOI via DataCi...

Related Articles

Machine Learning

[D] ICML 2026 Average Score

Hi all, I’m curious about the current review dynamics for ICML 2026, especially after the rebuttal phase. For those who are reviewers (or...

Reddit - Machine Learning · 1 min ·
Machine Learning

[R] VOID: Video Object and Interaction Deletion (physically-consistent video inpainting)

We present VOID, a model for video object removal that aims to handle *physical interactions*, not just appearance. Most existing video i...

Reddit - Machine Learning · 1 min ·
Machine Learning

FLUX 2 Pro (2026) Sketch to Image

I sketched a cow and tested how different models interpret it into a realistic image for downstream 3D generation, turns out some models ...

Reddit - Artificial Intelligence · 1 min ·
Improving AI models’ ability to explain their predictions
Machine Learning

Improving AI models’ ability to explain their predictions

AI News - General · 9 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime