[2602.15210] ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset
Summary
The paper discusses multilingual data curation strategies for training foundation models, revealing that targeted improvements in data quality can enhance performance across multiple languages, utilizing a 20-trillion-token dataset.
Why It Matters
As multilingual capabilities become essential for AI models, understanding how to effectively curate data across languages can lead to significant advancements in model performance. This research highlights the importance of data quality over quantity, providing a roadmap for future multilingual AI development.
Key Takeaways
- Targeted data curation improves multilingual model performance.
- Quality enhancements in one language can benefit others.
- A 20-trillion-token dataset can achieve competitive accuracy with fewer resources.
Computer Science > Machine Learning arXiv:2602.15210 (cs) [Submitted on 16 Feb 2026] Title:ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset Authors:DatologyAI: Aldo Gael Carranza, Kaleigh Mentzer, Ricardo Pio Monti, Alex Fang, Alvin Deng, Amro Abbas, Anshuman Suri, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Diego Kiner, Fan Pan, Haakon Mongstad, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Luke Merrick, Parth Doshi, Paul Burstein, Pratyush Maini, Spandan Das, Tony Jiang, Vineeth Dorna, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt View a PDF of the paper titled \"UberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset, by DatologyAI: Aldo Gael Carranza and 28 other authors View PDF HTML (experimental) Abstract:Multilinguality is a core capability for modern foundation models, yet training high-quality multilingual models remains challenging due to uneven data availability across languages. A further challenge is the performance interference that can arise from joint multilingual training, commonly referred to as the "curse of multilinguality". We study multilingual data curation across thirteen languages and find that many reported regressions are not inherent to multilingual scaling but instead stem from correctable deficiencies in data quality and composition rather than fundamental capacity limits. In controlled bilingual experiments, improving data quality for any single language benefits o...