[2602.16065] Can Generative Artificial Intelligence Survive Data Contamination? Theoretical Guarantees under Contaminated Recursive Training
Summary
This paper explores the resilience of generative AI models against data contamination during recursive training, providing theoretical guarantees and empirical support for convergence despite complex data distributions.
Why It Matters
As generative AI becomes more prevalent, understanding how these models handle data contamination is crucial for ensuring their reliability and effectiveness. This research addresses a significant gap in existing literature by demonstrating that recursive training can still converge under contaminated conditions, which has implications for the development and deployment of AI systems in real-world applications.
Key Takeaways
- Contaminated recursive training can still achieve convergence.
- The convergence rate depends on the baseline model and real data fraction.
- This study provides the first theoretical results without strict distributional assumptions.
Computer Science > Machine Learning arXiv:2602.16065 (cs) [Submitted on 17 Feb 2026] Title:Can Generative Artificial Intelligence Survive Data Contamination? Theoretical Guarantees under Contaminated Recursive Training Authors:Kevin Wang, Hongqian Niu, Didong Li View a PDF of the paper titled Can Generative Artificial Intelligence Survive Data Contamination? Theoretical Guarantees under Contaminated Recursive Training, by Kevin Wang and 2 other authors View PDF HTML (experimental) Abstract:Generative Artificial Intelligence (AI), such as large language models (LLMs), has become a transformative force across science, industry, and society. As these systems grow in popularity, web data becomes increasingly interwoven with this AI-generated material and it is increasingly difficult to separate them from naturally generated content. As generative models are updated regularly, later models will inevitably be trained on mixtures of human-generated data and AI-generated data from earlier versions, creating a recursive training process with data contamination. Existing theoretical work has examined only highly simplified settings, where both the real data and the generative model are discrete or Gaussian, where it has been shown that such recursive training leads to model collapse. However, real data distributions are far more complex, and modern generative models are far more flexible than Gaussian and linear mechanisms. To fill this gap, we study recursive training in a general ...