429 – Hugging Face

429 – Hugging Face

Hugging Face Blog 8 min read

About this article

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Back to Articles Streaming datasets: 100x More Efficient Published October 27, 2025 Update on GitHub Upvote 81 +75 Andres Marafioti andito Follow Quentin Lhoest lhoestq Follow ben burtenshaw burtenshaw Follow Pedro Cuenca pcuenq Follow merve merve Follow TLDR We boosted load_dataset('dataset', streaming=True), streaming datasets without downloading them with one line of code! Start training on multi-TB datasets immediately, without complex setups, downloading, no "disk out of space", or 429 “stop requesting!” errors.It's super fast! Outrunning our local SSDs when training on 64xH100 with 256 workers downloading data. We've improved streaming to have 100x fewer requests, → 10× faster data resolution → 2x sample/sec, → 0 worker crashes at 256 concurrent workers. Loading data, especially at the terabyte scale, is a major pain in any machine learning workflow. We suffered this while training SmolLM3, at one point we had to wait 3 hours before each run to download enough data. Streaming has always been possible in the datasets library, but large scale training with massive datasets remained a challenge. That changes today 🔥. We spent a few months improving the backend, focusing on streaming datasets to make it faster and more efficient. What did we do exactly? ⤵️ Streaming: The Same Easy API First things first: our changes are backwards compatible. You can still stream any dataset from the Hub with the same simple streaming=True flag. It's as easy as ever. 🚀 from datasets impor...

Originally published on February 15, 2026. Curated by AI News.

Related Articles

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents
Open Source Ai

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

A Blog post by IBM Granite on Hugging Face

Hugging Face Blog · 7 min ·
Llms

My AI spent last night modifying its own codebase

I've been working on a local AI system called Apis that runs completely offline through Ollama. During a background run, Apis identified ...

Reddit - Artificial Intelligence · 1 min ·
Llms

Depth-first pruning seems to transfer from GPT-2 to Llama (unexpectedly well)

TL;DR: Removing the right transformer layers (instead of shrinking all layers) gives smaller, faster models with minimal quality loss — a...

Reddit - Artificial Intelligence · 1 min ·
[2603.16430] EngGPT2: Sovereign, Efficient and Open Intelligence
Llms

[2603.16430] EngGPT2: Sovereign, Efficient and Open Intelligence

Abstract page for arXiv paper 2603.16430: EngGPT2: Sovereign, Efficient and Open Intelligence

arXiv - AI · 4 min ·
More in Open Source Ai: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime