[2602.17565] Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning
Summary
This paper explores optimal unconstrained self-distillation in ridge regression, demonstrating strict improvements in prediction risk and proposing a one-shot tuning method for practical application.
Why It Matters
The study addresses the formal guarantees of self-distillation in machine learning, particularly in ridge regression, which is crucial for enhancing model performance. By providing a closed-form solution and practical tuning method, it offers valuable insights for researchers and practitioners in statistics and machine learning.
Key Takeaways
- Self-distillation can improve prediction risk in ridge regression under certain conditions.
- The optimal mixing weight for self-distillation can be negative, especially in over-regularized scenarios.
- A one-shot tuning method for estimating the optimal mixing weight eliminates the need for complex tuning processes.
Mathematics > Statistics Theory arXiv:2602.17565 (math) [Submitted on 19 Feb 2026] Title:Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning Authors:Hien Dang, Pratik Patil, Alessandro Rinaldo View a PDF of the paper titled Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning, by Hien Dang and 1 other authors View PDF Abstract:Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth labels and the teacher's own predictions using the same architecture and training data. Although SD has been empirically shown to often improve generalization, its formal guarantees remain limited. We study SD for ridge regression in unconstrained setting in which the mixing weight $\xi$ may be outside the unit interval. Conditioned on the training data and without any distributional assumptions, we prove that for any squared prediction risk (including out-of-distribution), the optimally mixed student strictly improves upon the ridge teacher for every regularization level $\lambda > 0$ at which the teacher ridge risk $R(\lambda)$ is nonstationary (i.e., $R'(\lambda) \neq 0$). We obtain a closed-form expression for the optimal mixing weight $\xi^\star(\lambda)$ for any value of $\lambda$ and show that it obeys the sign rule: $\operatorname{sign}(\xi^\star(\lambda))=-\operatorname{sign}(R'(\lambda))$. In particular, $\xi^\st...