[2603.21606] mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT
About this article
Abstract page for arXiv paper 2603.21606: mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT
Computer Science > Machine Learning arXiv:2603.21606 (cs) [Submitted on 23 Mar 2026] Title:mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT Authors:Woosung Koh, Jeyoung Jeon, Youngjin Song, Yujin Cheon, Soowon Oh, Jaehyeong Choi, Se-Young Yun View a PDF of the paper titled mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT, by Woosung Koh and 6 other authors View PDF HTML (experimental) Abstract:Current language model training commonly applies multi-task Supervised Fine-Tuning (SFT) using a homogeneous compute budget across all sub-datasets. This approach is fundamentally sub-optimal: heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted. To address this, we introduce mSFT, an iterative, overfitting-aware search algorithm for multi-task data mixtures. mSFT trains the model on an active mixture, identifies and excludes the earliest overfitting sub-dataset, and reverts to that specific optimal checkpoint before continuing. Extensive evaluations demonstrate that mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models. Further analysis confirms mSFT maintains robust gains across diverse dataset sizes, task granularities, and is insensitive to its single new hyperparameter (compute budget). Notably, at low compute budget, mSFT can improve performance while lowering training FLOPs. Ultimately, mSFT establishes a practical overfitting-awa...