[2602.17565] Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning

[2602.17565] Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning

arXiv - Machine Learning 4 min read Article

Summary

This paper explores optimal unconstrained self-distillation in ridge regression, demonstrating strict improvements in prediction risk and proposing a one-shot tuning method for practical application.

Why It Matters

The study addresses the formal guarantees of self-distillation in machine learning, particularly in ridge regression, which is crucial for enhancing model performance. By providing a closed-form solution and practical tuning method, it offers valuable insights for researchers and practitioners in statistics and machine learning.

Key Takeaways

  • Self-distillation can improve prediction risk in ridge regression under certain conditions.
  • The optimal mixing weight for self-distillation can be negative, especially in over-regularized scenarios.
  • A one-shot tuning method for estimating the optimal mixing weight eliminates the need for complex tuning processes.

Mathematics > Statistics Theory arXiv:2602.17565 (math) [Submitted on 19 Feb 2026] Title:Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning Authors:Hien Dang, Pratik Patil, Alessandro Rinaldo View a PDF of the paper titled Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning, by Hien Dang and 1 other authors View PDF Abstract:Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth labels and the teacher's own predictions using the same architecture and training data. Although SD has been empirically shown to often improve generalization, its formal guarantees remain limited. We study SD for ridge regression in unconstrained setting in which the mixing weight $\xi$ may be outside the unit interval. Conditioned on the training data and without any distributional assumptions, we prove that for any squared prediction risk (including out-of-distribution), the optimally mixed student strictly improves upon the ridge teacher for every regularization level $\lambda > 0$ at which the teacher ridge risk $R(\lambda)$ is nonstationary (i.e., $R'(\lambda) \neq 0$). We obtain a closed-form expression for the optimal mixing weight $\xi^\star(\lambda)$ for any value of $\lambda$ and show that it obeys the sign rule: $\operatorname{sign}(\xi^\star(\lambda))=-\operatorname{sign}(R'(\lambda))$. In particular, $\xi^\st...

Related Articles

Llms

[R] Hybrid attention for small code models: 50x faster inference, but data scaling still dominates

TLDR: Forked pytorch and triton internals . Changed attention so its linear first layer , middle quadratic layer, last linear layer Infer...

Reddit - Machine Learning · 1 min ·
UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
Improving AI models’ ability to explain their predictions
Machine Learning

Improving AI models’ ability to explain their predictions

AI News - General · 9 min ·
AI Hiring Growth: AI and ML Hiring Surges 37% in Marche
Machine Learning

AI Hiring Growth: AI and ML Hiring Surges 37% in Marche

AI News - General · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime