429 – Hugging Face
About this article
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Back to Articles Streaming datasets: 100x More Efficient Published October 27, 2025 Update on GitHub Upvote 81 +75 Andres Marafioti andito Follow Quentin Lhoest lhoestq Follow ben burtenshaw burtenshaw Follow Pedro Cuenca pcuenq Follow merve merve Follow TLDR We boosted load_dataset('dataset', streaming=True), streaming datasets without downloading them with one line of code! Start training on multi-TB datasets immediately, without complex setups, downloading, no "disk out of space", or 429 “stop requesting!” errors.It's super fast! Outrunning our local SSDs when training on 64xH100 with 256 workers downloading data. We've improved streaming to have 100x fewer requests, → 10× faster data resolution → 2x sample/sec, → 0 worker crashes at 256 concurrent workers. Loading data, especially at the terabyte scale, is a major pain in any machine learning workflow. We suffered this while training SmolLM3, at one point we had to wait 3 hours before each run to download enough data. Streaming has always been possible in the datasets library, but large scale training with massive datasets remained a challenge. That changes today 🔥. We spent a few months improving the backend, focusing on streaming datasets to make it faster and more efficient. What did we do exactly? ⤵️ Streaming: The Same Easy API First things first: our changes are backwards compatible. You can still stream any dataset from the Hub with the same simple streaming=True flag. It's as easy as ever. 🚀 from datasets impor...