[2602.16136] Retrieval Collapses When AI Pollutes the Web
Summary
The paper discusses the phenomenon of 'Retrieval Collapse,' where AI-generated content dominates search results, leading to a decline in content quality and diversity.
Why It Matters
As AI-generated content proliferates, it poses a significant risk to information retrieval systems. Understanding 'Retrieval Collapse' is crucial for developing strategies to maintain content quality and diversity in search results, which is vital for accurate information dissemination.
Key Takeaways
- Retrieval Collapse occurs when AI-generated content overwhelms search results.
- A significant portion of low-quality content can lead to misleadingly stable answer accuracy.
- LLM-based rankers may suppress harmful content more effectively than traditional methods.
Computer Science > Information Retrieval arXiv:2602.16136 (cs) [Submitted on 18 Feb 2026] Title:Retrieval Collapses When AI Pollutes the Web Authors:Hongyeon Yu, Dongchan Kim, Young-Bum Kim View a PDF of the paper titled Retrieval Collapses When AI Pollutes the Web, by Hongyeon Yu and 2 other authors View PDF HTML (experimental) Abstract:The rapid proliferation of AI-generated content on the Web presents a structural risk to information retrieval, as search engines and Retrieval-Augmented Generation (RAG) systems increasingly consume evidence produced by the Large Language Models (LLMs). We characterize this ecosystem-level failure mode as Retrieval Collapse, a two-stage process where (1) AI-generated content dominates search results, eroding source diversity, and (2) low-quality or adversarial content infiltrates the retrieval pipeline. We analyzed this dynamic through controlled experiments involving both high-quality SEO-style content and adversarially crafted content. In the SEO scenario, a 67\% pool contamination led to over 80\% exposure contamination, creating a homogenized yet deceptively healthy state where answer accuracy remains stable despite the reliance on synthetic sources. Conversely, under adversarial contamination, baselines like BM25 exposed $\sim$19\% of harmful content, whereas LLM-based rankers demonstrated stronger suppression capabilities. These findings highlight the risk of retrieval pipelines quietly shifting toward synthetic evidence and the nee...