Why AI Is Training on Its Own Garbage (and How to Fix It)
Machine Learning Why AI Is Training on Its Own Garbage (and How to Fix It) Deep web data is the gold we can't touch, yet Sabrine Bendimerad Apr 8, 2026 7 min read Share Image generated with Gemini If you have been interested in AI for a while, you are probably an LLM/Agent/Chat user, but have you ever asked yourself how these tools will be trained in the near future, and what if we have already used up the data we need to train models? Many theories say that we are running out of high-quality, human-generated data to train our models. New content goes up every day, that’s a reality, but an increasing share of what gets added daily is itself AI-generated. So if you keep training on public web data, you’re eventually training on the outputs of your own predecessors. The snake eating its tail. Researchers call this phenomenon Model Collapse, where AI models start learning from the errors of their predecessors until the whole system degrades into nonsense. But what if I told you we aren’t actually running out of data? We’ve just been looking in the wrong place. In this article, I am going to break down the key insights from this brilliant paper. The Web We Already use and the Web That Matters Most of us consider the web as a unique source of information. In reality, there are at least two. There is the Surface Web: the indexed, public world like what we find on Reddit, Wikipedia, and news sites. This is what we’ve already scraped and overused for years to train the mainstream ...