[2602.20062] A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning
Summary
This paper presents a theoretical framework explaining how pretraining influences inductive bias during fine-tuning in machine learning, particularly in neural networks.
Why It Matters
Understanding the relationship between pretraining and fine-tuning is crucial for improving model generalization in machine learning. This research provides insights into how initialization choices affect feature learning, which can enhance performance across various tasks.
Key Takeaways
- Different initialization choices lead to distinct fine-tuning regimes.
- Smaller initialization scales in earlier layers enhance feature reuse and refinement.
- The study derives exact expressions for generalization error based on initialization parameters.
- Empirical results confirm the theoretical findings in nonlinear networks.
- The interaction between data and initialization is pivotal for fine-tuning success.
Computer Science > Machine Learning arXiv:2602.20062 (cs) [Submitted on 23 Feb 2026] Title:A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning Authors:Nicolas Anguita, Francesco Locatello, Andrew M. Saxe, Marco Mondelli, Flavia Mancini, Samuel Lippl, Clementine Domine View a PDF of the paper titled A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning, by Nicolas Anguita and 6 other authors View PDF HTML (experimental) Abstract:Pretraining and fine-tuning are central stages in modern machine learning systems. In practice, feature learning plays an important role across both stages: deep neural networks learn a broad range of useful features during pretraining and further refine those features during fine-tuning. However, an end-to-end theoretical understanding of how choices of initialization impact the ability to reuse and refine features during fine-tuning has remained elusive. Here we develop an analytical theory of the pretraining-fine-tuning pipeline in diagonal linear networks, deriving exact expressions for the generalization error as a function of initialization parameters and task statistics. We find that different initialization choices place the network into four distinct fine-tuning regimes that are distinguished by their ability to support feature learning and reuse, and therefore by the task statistics for which they are beneficial. In particular, a smaller initialization scale in earlier layers enables the network to both reuse and re...