Machine Learning Ai Safety

[2510.12402] Cautious Weight Decay

arXiv - Machine Learning February 25, 2026 3 min read Article

Summary

The paper introduces Cautious Weight Decay (CWD), a novel optimizer-agnostic method that selectively applies weight decay during optimization, enhancing performance in large-scale machine learning tasks.

Why It Matters

CWD offers a significant advancement in optimization techniques by improving loss and accuracy without requiring additional hyperparameters. This can lead to better performance in various machine learning applications, particularly in language models and image classification, making it relevant for researchers and practitioners in the field.

Key Takeaways

CWD modifies weight decay to align with optimizer updates, enhancing optimization.
It is compatible with popular optimizers like AdamW and requires no new hyperparameters.
CWD improves performance in language model pre-training and ImageNet classification.
The method preserves the original loss function, allowing for a bilevel optimization interpretation.
CWD facilitates the search for locally Pareto-optimal stationary points.

Computer Science > Machine Learning arXiv:2510.12402 (cs) [Submitted on 14 Oct 2025 (v1), last revised 24 Feb 2026 (this version, v2)] Title:Cautious Weight Decay Authors:Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, Qiang Liu View a PDF of the paper titled Cautious Weight Decay, by Lizhang Chen and 8 other authors View PDF HTML (experimental) Abstract:We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2510.12402 [cs.LG] (or arXiv:2510.12402v2 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.12402 Focus to learn more arXiv-issued DOI via DataCi...

Read Original Article

[2510.12402] Cautious Weight Decay

Summary

Why It Matters

Key Takeaways

Related Articles

[D] ICML 2026 Average Score

[R] VOID: Video Object and Interaction Deletion (physically-consistent video inpainting)

FLUX 2 Pro (2026) Sketch to Image

Improving AI models’ ability to explain their predictions

No comments

Stay updated with AI News