[2507.04448] Transfer Learning in Infinite Width Feature Learning Networks
Summary
This article presents a theoretical framework for understanding transfer learning in infinitely wide neural networks, focusing on how pretraining can enhance generalization for target tasks.
Why It Matters
The research provides valuable insights into transfer learning, a critical area in machine learning that can significantly improve model performance. Understanding the dynamics of feature learning in neural networks can lead to better applications in various domains, including AI and data science.
Key Takeaways
- The study quantifies the impact of pretraining on generalization in neural networks.
- Two scenarios are analyzed: fine-tuning and jointly rich settings for feature learning.
- The performance is influenced by data quantity, task alignment, and feature learning strength.
- The theory is tested on both synthetic and real datasets, providing interpretable conclusions.
- Adaptive kernels are identified as key components in understanding performance dynamics.
Computer Science > Machine Learning arXiv:2507.04448 (cs) [Submitted on 6 Jul 2025 (v1), last revised 24 Feb 2026 (this version, v2)] Title:Transfer Learning in Infinite Width Feature Learning Networks Authors:Clarissa Lauditi, Blake Bordelon, Cengiz Pehlevan View a PDF of the paper titled Transfer Learning in Infinite Width Feature Learning Networks, by Clarissa Lauditi and 2 other authors View PDF HTML (experimental) Abstract:We develop a theory of transfer learning in infinitely wide neural networks under gradient flow that quantifies when pretraining on a source task improves generalization on a target task. We analyze both (i) fine-tuning, when the downstream predictor is trained on top of source-induced features and (ii) a jointly rich setting, where both pretraining and downstream tasks can operate in a feature learning regime, but the downstream model is initialized with the features obtained after pre-training. In this setup, the summary statistics of randomly initialized networks after a rich pre-training are adaptive kernels which depend on both source data and labels. For (i), we analyze the performance of a readout for different pretraining data regimes. For (ii), the summary statistics after learning the target task are still adaptive kernels with features from both source and target tasks. We test our theory on linear and polynomial regression tasks as well as real datasets. Our theory allows interpretable conclusions on performance, which depend on the amou...