[2602.15210] ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset

[2602.15210] ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset

arXiv - Machine Learning 4 min read Article

Summary

The paper discusses multilingual data curation strategies for training foundation models, revealing that targeted improvements in data quality can enhance performance across multiple languages, utilizing a 20-trillion-token dataset.

Why It Matters

As multilingual capabilities become essential for AI models, understanding how to effectively curate data across languages can lead to significant advancements in model performance. This research highlights the importance of data quality over quantity, providing a roadmap for future multilingual AI development.

Key Takeaways

  • Targeted data curation improves multilingual model performance.
  • Quality enhancements in one language can benefit others.
  • A 20-trillion-token dataset can achieve competitive accuracy with fewer resources.

Computer Science > Machine Learning arXiv:2602.15210 (cs) [Submitted on 16 Feb 2026] Title:ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset Authors:DatologyAI: Aldo Gael Carranza, Kaleigh Mentzer, Ricardo Pio Monti, Alex Fang, Alvin Deng, Amro Abbas, Anshuman Suri, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Diego Kiner, Fan Pan, Haakon Mongstad, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Luke Merrick, Parth Doshi, Paul Burstein, Pratyush Maini, Spandan Das, Tony Jiang, Vineeth Dorna, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt View a PDF of the paper titled \"UberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset, by DatologyAI: Aldo Gael Carranza and 28 other authors View PDF HTML (experimental) Abstract:Multilinguality is a core capability for modern foundation models, yet training high-quality multilingual models remains challenging due to uneven data availability across languages. A further challenge is the performance interference that can arise from joint multilingual training, commonly referred to as the "curse of multilinguality". We study multilingual data curation across thirteen languages and find that many reported regressions are not inherent to multilingual scaling but instead stem from correctable deficiencies in data quality and composition rather than fundamental capacity limits. In controlled bilingual experiments, improving data quality for any single language benefits o...

Related Articles

Llms

[D] How to break free from LLM's chains as a PhD student?

I didn't realize but over a period of one year i have become overreliant on ChatGPT to write code, I am a second year PhD student and don...

Reddit - Machine Learning · 1 min ·
Llms

[R] Reference model free behavioral discovery of AudiBench model organisms via Probe-Mediated Adaptive Auditing

Anthropic's AuditBench - 56 Llama 3.3 70B models with planted hidden behaviors - their best agent detects the behaviros 10-13% of the tim...

Reddit - Machine Learning · 1 min ·
Llms

[P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

The problem If you work with Italian text and local models, you know the pain. Every open-source LLM out there treats Italian as an after...

Reddit - Machine Learning · 1 min ·
Llms

I have been coding for 11 years and I caught myself completely unable to debug a problem without AI assistance last month. That scared me more than anything I have seen in this industry.

I want to be honest about something that happened to me because I think it is more common than people admit. Last month I hit a bug in a ...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime