[2510.08095] Beyond Real Data: Synthetic Data through the Lens of Regularization
About this article
Abstract page for arXiv paper 2510.08095: Beyond Real Data: Synthetic Data through the Lens of Regularization
Statistics > Machine Learning arXiv:2510.08095 (stat) [Submitted on 9 Oct 2025 (v1), last revised 31 Mar 2026 (this version, v2)] Title:Beyond Real Data: Synthetic Data through the Lens of Regularization Authors:Amitis Shidani, Tyler Farghly, Yang Sun, Habib Ganjgahi, George Deligiannidis View a PDF of the paper titled Beyond Real Data: Synthetic Data through the Lens of Regularization, by Amitis Shidani and 4 other authors View PDF HTML (experimental) Abstract:Synthetic data can improve generalization when real data is scarce, but excessive reliance may introduce distributional mismatches that degrade performance. In this paper, we present a learning-theoretic framework to quantify the trade-off between synthetic and real data. Our approach leverages algorithmic stability to derive generalization error bounds, characterizing the optimal synthetic-to-real data ratio that minimizes expected test error as a function of the Wasserstein distance between the real and synthetic distributions. We motivate our framework in the setting of kernel ridge regression with mixed data, offering a detailed analysis that may be of independent interest. Our theory predicts the existence of an optimal ratio, leading to a U-shaped behavior of test error with respect to the proportion of synthetic data. Empirically, we validate this prediction on CIFAR-10 and a clinical brain MRI dataset. Our theory extends to the important scenario of domain adaptation, showing that carefully blending syntheti...