[2602.16601] Error Propagation and Model Collapse in Diffusion Models: A Theoretical Study
Summary
This theoretical study examines error propagation and model collapse in diffusion models, highlighting how recursive training on synthetic data can degrade performance and drift from target distributions.
Why It Matters
As machine learning increasingly relies on synthetic data, understanding the implications of error propagation and model collapse is crucial for developing robust models. This research provides theoretical insights and empirical evidence that can help practitioners mitigate performance degradation in generative models.
Key Takeaways
- Recursive training on synthetic data can lead to significant performance degradation.
- The study provides upper and lower bounds on the divergence between generated and target distributions.
- Different regimes of drift are characterized based on score estimation error and fresh data proportions.
- Empirical results support the theoretical findings, illustrating the practical implications.
- Understanding these dynamics is essential for improving the reliability of diffusion models.
Statistics > Machine Learning arXiv:2602.16601 (stat) [Submitted on 18 Feb 2026] Title:Error Propagation and Model Collapse in Diffusion Models: A Theoretical Study Authors:Nail B. Khelifa, Richard E. Turner, Ramji Venkataramanan View a PDF of the paper titled Error Propagation and Model Collapse in Diffusion Models: A Theoretical Study, by Nail B. Khelifa and 2 other authors View PDF HTML (experimental) Abstract:Machine learning models are increasingly trained or fine-tuned on synthetic data. Recursively training on such data has been observed to significantly degrade performance in a wide range of tasks, often characterized by a progressive drift away from the target distribution. In this work, we theoretically analyze this phenomenon in the setting of score-based diffusion models. For a realistic pipeline where each training round uses a combination of synthetic data and fresh samples from the target distribution, we obtain upper and lower bounds on the accumulated divergence between the generated and target distributions. This allows us to characterize different regimes of drift, depending on the score estimation error and the proportion of fresh data used in each generation. We also provide empirical results on synthetic data and images to illustrate the theory. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2602.16601 [stat.ML] (or arXiv:2602.16601v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2602.16601 Focus to lear...