I spent years building a 103B-token Usenet corpus (1980–2013) and finally documented it [P]

Reddit - Machine Learning 1 min read

About this article

For the past several years I've been quietly assembling and processing what I believe is one of the larger privately held pretraining corpora around... a complete Usenet archive spanning 1980 to 2013. Here's what it ended up being: 103.1 billion tokens (cl100k_base) 408 million posts across 9 newsgroup hierarchies 18,347 newsgroups covered 33 years of continuous coverage The processing pipeline included full deduplication, binary removal (alt.binaries.* excluded at the hierarchy level before ...

You've been blocked by network security.To continue, log in to your Reddit account or use your developer tokenIf you think you've been blocked by mistake, file a ticket below and we'll look into it.Log in File a ticket

Originally published on May 01, 2026. Curated by AI News.

Related Articles

Machine Learning

ICML final decisions rant [D]

So, ICML accepted ~6.5K of ~24K; obviously, it doesn't mean that all the rejected papers are "bad," and these rejected papers would casca...

Reddit - Machine Learning · 1 min ·
Machine Learning

Open-source diagnostic for AI misalignment. Model agnostic, industry agnostic. Free to Run.

We shipped iFixAi earlier this week. An open-source diagnostic for AI misalignment. 32 tests across fabrication, manipulation, deception,...

Reddit - Artificial Intelligence · 1 min ·
Musk v. Altman is just getting started | TechCrunch
Machine Learning

Musk v. Altman is just getting started | TechCrunch

Watch as the Equity podcast team discusses what's actually at stake in the courtroom and what to watch for as Altman and others take the ...

TechCrunch - AI · 3 min ·
Did you know you can't steal a charity? Don't worry. Elon Musk will remind you. | TechCrunch
Machine Learning

Did you know you can't steal a charity? Don't worry. Elon Musk will remind you. | TechCrunch

Today on Equity, we break down what's actually at stake in the Musk v Altman case, plus deals, defense tech, and what Big Tech's earnings...

TechCrunch - AI · 4 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime