[2510.08095] Beyond Real Data: Synthetic Data through the Lens of Regularization
Nlp

[2510.08095] Beyond Real Data: Synthetic Data through the Lens of Regularization

arXiv - Machine Learning 4 min read

About this article

Abstract page for arXiv paper 2510.08095: Beyond Real Data: Synthetic Data through the Lens of Regularization

Statistics > Machine Learning arXiv:2510.08095 (stat) [Submitted on 9 Oct 2025 (v1), last revised 31 Mar 2026 (this version, v2)] Title:Beyond Real Data: Synthetic Data through the Lens of Regularization Authors:Amitis Shidani, Tyler Farghly, Yang Sun, Habib Ganjgahi, George Deligiannidis View a PDF of the paper titled Beyond Real Data: Synthetic Data through the Lens of Regularization, by Amitis Shidani and 4 other authors View PDF HTML (experimental) Abstract:Synthetic data can improve generalization when real data is scarce, but excessive reliance may introduce distributional mismatches that degrade performance. In this paper, we present a learning-theoretic framework to quantify the trade-off between synthetic and real data. Our approach leverages algorithmic stability to derive generalization error bounds, characterizing the optimal synthetic-to-real data ratio that minimizes expected test error as a function of the Wasserstein distance between the real and synthetic distributions. We motivate our framework in the setting of kernel ridge regression with mixed data, offering a detailed analysis that may be of independent interest. Our theory predicts the existence of an optimal ratio, leading to a U-shaped behavior of test error with respect to the proportion of synthetic data. Empirically, we validate this prediction on CIFAR-10 and a clinical brain MRI dataset. Our theory extends to the important scenario of domain adaptation, showing that carefully blending syntheti...

Originally published on April 02, 2026. Curated by AI News.

Related Articles

[2602.00750] Bypassing Prompt Injection Detectors through Evasive Injections
Llms

[2602.00750] Bypassing Prompt Injection Detectors through Evasive Injections

Abstract page for arXiv paper 2602.00750: Bypassing Prompt Injection Detectors through Evasive Injections

arXiv - AI · 4 min ·
[2512.18640] Geometric-Photometric Event-based 3D Gaussian Ray Tracing
Nlp

[2512.18640] Geometric-Photometric Event-based 3D Gaussian Ray Tracing

Abstract page for arXiv paper 2512.18640: Geometric-Photometric Event-based 3D Gaussian Ray Tracing

arXiv - AI · 4 min ·
[2511.08225] Benchmarking Educational LLMs with Analytics: A Case Study on Gender Bias in Feedback
Llms

[2511.08225] Benchmarking Educational LLMs with Analytics: A Case Study on Gender Bias in Feedback

Abstract page for arXiv paper 2511.08225: Benchmarking Educational LLMs with Analytics: A Case Study on Gender Bias in Feedback

arXiv - AI · 4 min ·
[2511.20224] DuoTok: Source-Aware Dual-Track Tokenization for Multi-Track Music Language Modeling
Llms

[2511.20224] DuoTok: Source-Aware Dual-Track Tokenization for Multi-Track Music Language Modeling

Abstract page for arXiv paper 2511.20224: DuoTok: Source-Aware Dual-Track Tokenization for Multi-Track Music Language Modeling

arXiv - AI · 3 min ·
More in Nlp: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime