Machine Learning Nlp Data Science

I spent years building a 103B-token Usenet corpus (1980–2013) and finally documented it [P]

Reddit - Machine Learning May 01, 2026 1 min read

About this article

For the past several years I've been quietly assembling and processing what I believe is one of the larger privately held pretraining corpora around... a complete Usenet archive spanning 1980 to 2013. Here's what it ended up being: 103.1 billion tokens (cl100k_base) 408 million posts across 9 newsgroup hierarchies 18,347 newsgroups covered 33 years of continuous coverage The processing pipeline included full deduplication, binary removal (alt.binaries.* excluded at the hierarchy level before ...

You've been blocked by network security.To continue, log in to your Reddit account or use your developer tokenIf you think you've been blocked by mistake, file a ticket below and we'll look into it.Log in File a ticket

Originally published on May 01, 2026. Curated by AI News.

Read Original Article

Machine Learning

ICML final decisions rant [D]

So, ICML accepted ~6.5K of ~24K; obviously, it doesn't mean that all the rejected papers are "bad," and these rejected papers would casca...

Reddit - Machine Learning · 1 min · 35 minutes ago

Machine Learning

Open-source diagnostic for AI misalignment. Model agnostic, industry agnostic. Free to Run.

We shipped iFixAi earlier this week. An open-source diagnostic for AI misalignment. 32 tests across fabrication, manipulation, deception,...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Machine Learning

Musk v. Altman is just getting started | TechCrunch

Watch as the Equity podcast team discusses what's actually at stake in the courtroom and what to watch for as Altman and others take the ...

TechCrunch - AI · 3 min · about 3 hours ago

Machine Learning

Did you know you can't steal a charity? Don't worry. Elon Musk will remind you. | TechCrunch

Today on Equity, we break down what's actually at stake in the Musk v Altman case, plus deals, defense tech, and what Big Tech's earnings...

TechCrunch - AI · 4 min · about 3 hours ago

More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

I spent years building a 103B-token Usenet corpus (1980–2013) and finally documented it [P]

About this article

Related Articles

ICML final decisions rant [D]

Open-source diagnostic for AI misalignment. Model agnostic, industry agnostic. Free to Run.

Musk v. Altman is just getting started | TechCrunch

Did you know you can't steal a charity? Don't worry. Elon Musk will remind you. | TechCrunch

No comments

Stay updated with AI News