[2602.14029] Why Self-Training Helps and Hurts: Denoising vs. Signal Forgetting
Summary
This paper investigates the dual effects of iterative self-training in machine learning, focusing on the balance between denoising and signal forgetting in overparameterized linear regression models.
Why It Matters
Understanding the dynamics of self-training is crucial for improving machine learning models, especially in high-dimensional settings. This research provides insights into optimizing model training processes, which can enhance predictive accuracy and efficiency in various applications.
Key Takeaways
- Self-training can lead to both denoising and signal forgetting, impacting model performance.
- The study introduces a U-shaped test-risk curve and optimal early-stopping criteria.
- Iterative self-training acts as a spectral filter, enhancing strong features while suppressing weaker ones.
- A new generalized cross-validation criterion is proposed for data-driven stopping time selection.
- Experiments validate the theoretical findings, demonstrating practical implications for model training.
Statistics > Machine Learning arXiv:2602.14029 (stat) [Submitted on 15 Feb 2026] Title:Why Self-Training Helps and Hurts: Denoising vs. Signal Forgetting Authors:Mingqi Wu, Archer Y. Yang, Qiang Sun View a PDF of the paper titled Why Self-Training Helps and Hurts: Denoising vs. Signal Forgetting, by Mingqi Wu and 2 other authors View PDF HTML (experimental) Abstract:Iterative self-training (self-distillation) repeatedly refits a model on pseudo-labels generated by its own predictions. We study this procedure in overparameterized linear regression: an initial estimator is trained on noisy labels, and each subsequent iterate is trained on fresh covariates with noiseless pseudo-labels from the previous model. In the high-dimensional regime, we derive deterministic-equivalent recursions for the prediction risk and effective noise across iterations, and prove that the empirical quantities concentrate sharply around these limits. The recursion separates two competing forces: a systematic component that grows with iteration due to progressive signal forgetting, and a stochastic component that decays due to denoising via repeated data-dependent projections. Their interaction yields a $U$-shaped test-risk curve and an optimal early-stopping time. In spiked covariance models, iteration further acts as an iteration-dependent spectral filter that preserves strong eigendirections while suppressing weaker ones, inducing an implicit form of soft feature selection distinct from ridge regr...