[2603.16177] The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data
About this article
Abstract page for arXiv paper 2603.16177: The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data
Computer Science > Machine Learning arXiv:2603.16177 (cs) [Submitted on 17 Mar 2026 (v1), last revised 21 Mar 2026 (this version, v2)] Title:The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data Authors:Christina Baek, Ricardo Pio Monti, David Schwab, Amro Abbas, Rishabh Adiga, Cody Blakeney, Maximilian Böther, Paul Burstein, Aldo Gael Carranza, Alvin Deng, Parth Doshi, Vineeth Dorna, Alex Fang, Tony Jiang, Siddharth Joshi, Brett W. Larsen, Jason Chan Lee, Katherine L. Mentzer, Luke Merrick, Haakon Mongstad, Fan Pan, Anshuman Suri, Darren Teh, Jason Telanoff, Jack Urbanek, Zhengping Wang, Josh Wills, Haoli Yin, Aditi Raghunathan, J. Zico Kolter, Bogdan Gaza, Ari Morcos, Matthew Leavitt, Pratyush Maini View a PDF of the paper titled The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data, by Christina Baek and 33 other authors View PDF HTML (experimental) Abstract:Real-world model deployments demand strong performance on narrow domains where data is often scarce. Typically, practitioners finetune models to specialize them, but this risks overfitting to the domain and forgetting general knowledge. We study a simple strategy, specialized pretraining (SPT), where a small domain dataset, typically reserved for finetuning, is repeated starting from pretraining as a fraction of the total tokens. Across three specialized domains (ChemPile, MusicPile, and ProofPile), SPT improves domain performance and preserves general capabilities after finetuning compar...