[2602.12039] The Implicit Bias of Logit Regularization
Summary
The paper explores the implicit bias introduced by logit regularization in classifiers, demonstrating its effects on weight alignment and generalization in linear classification.
Why It Matters
Understanding logit regularization is crucial for improving machine learning models' calibration and generalization. This research provides insights into how logit clustering can enhance model performance, particularly in noisy environments, making it relevant for practitioners aiming to optimize classification tasks.
Key Takeaways
- Logit regularization can significantly improve model calibration and generalization.
- The implicit bias of logit clustering aligns weight vectors with Fisher's Linear Discriminant.
- Logit regularization reduces critical sample complexity and enhances robustness to noise.
- The study extends theoretical understanding of label smoothing and its implications.
- Insights from this research can inform the development of more effective classification strategies.
Statistics > Machine Learning arXiv:2602.12039 (stat) [Submitted on 12 Feb 2026 (v1), last revised 13 Feb 2026 (this version, v2)] Title:The Implicit Bias of Logit Regularization Authors:Alon Beck, Yohai Bar Sinai, Noam Levi View a PDF of the paper titled The Implicit Bias of Logit Regularization, by Alon Beck and 2 other authors View PDF HTML (experimental) Abstract:Logit regularization, the addition of a convex penalty directly in logit space, is widely used in modern classifiers, with label smoothing as a prominent example. While such methods often improve calibration and generalization, their mechanism remains under-explored. In this work, we analyze a general class of such logit regularizers in the context of linear classification, and demonstrate that they induce an implicit bias of logit clustering around finite per-sample targets. For Gaussian data, or whenever logits are sufficiently clustered, we prove that logit clustering drives the weight vector to align exactly with Fisher's Linear Discriminant. To demonstrate the consequences, we study a simple signal-plus-noise model in which this transition has dramatic effects: Logit regularization halves the critical sample complexity and induces grokking in the small-noise limit, while making generalization robust to noise. Our results extend the theoretical understanding of label smoothing and highlight the efficacy of a broader class of logit-regularization methods. Subjects: Machine Learning (stat.ML); Machine Learni...