[2602.14029] Why Self-Training Helps and Hurts: Denoising vs. Signal Forgetting

arXiv - Machine Learning February 17, 2026 4 min read Article

Summary

This paper investigates the dual effects of iterative self-training in machine learning, focusing on the balance between denoising and signal forgetting in overparameterized linear regression models.

Why It Matters

Understanding the dynamics of self-training is crucial for improving machine learning models, especially in high-dimensional settings. This research provides insights into optimizing model training processes, which can enhance predictive accuracy and efficiency in various applications.

Key Takeaways

Self-training can lead to both denoising and signal forgetting, impacting model performance.
The study introduces a U-shaped test-risk curve and optimal early-stopping criteria.
Iterative self-training acts as a spectral filter, enhancing strong features while suppressing weaker ones.
A new generalized cross-validation criterion is proposed for data-driven stopping time selection.
Experiments validate the theoretical findings, demonstrating practical implications for model training.

Statistics > Machine Learning arXiv:2602.14029 (stat) [Submitted on 15 Feb 2026] Title:Why Self-Training Helps and Hurts: Denoising vs. Signal Forgetting Authors:Mingqi Wu, Archer Y. Yang, Qiang Sun View a PDF of the paper titled Why Self-Training Helps and Hurts: Denoising vs. Signal Forgetting, by Mingqi Wu and 2 other authors View PDF HTML (experimental) Abstract:Iterative self-training (self-distillation) repeatedly refits a model on pseudo-labels generated by its own predictions. We study this procedure in overparameterized linear regression: an initial estimator is trained on noisy labels, and each subsequent iterate is trained on fresh covariates with noiseless pseudo-labels from the previous model. In the high-dimensional regime, we derive deterministic-equivalent recursions for the prediction risk and effective noise across iterations, and prove that the empirical quantities concentrate sharply around these limits. The recursion separates two competing forces: a systematic component that grows with iteration due to progressive signal forgetting, and a stochastic component that decays due to denoising via repeated data-dependent projections. Their interaction yields a $U$-shaped test-risk curve and an optimal early-stopping time. In spiked covariance models, iteration further acts as an iteration-dependent spectral filter that preserves strong eigendirections while suppressing weaker ones, inducing an implicit form of soft feature selection distinct from ridge regr...

Read Original Article

[2602.14029] Why Self-Training Helps and Hurts: Denoising vs. Signal Forgetting

Summary

Why It Matters

Key Takeaways

Related Articles

Danger Words - Where Words Are Weapons

The Download: an exclusive Jeff VanderMeer story and AI models too scary to release | MIT Technology Review

What's your "When Language Model AI can do X, I'll be impressed"?

Meta’s New AI Asked for My Raw Health Data—and Gave Me Terrible Advice | WIRED

No comments

Stay updated with AI News