Cosmopedia: how to create large-scale synthetic data for pre-training

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Hugging Face Blog February 15, 2026 15 min read

About this article

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Back to Articles Cosmopedia: how to create large-scale synthetic data for pre-training Published March 20, 2024 Update on GitHub Upvote 109 +103 Loubna Ben Allal loubnabnl Follow Anton Lozhkov anton-l Follow Daniel van Strien davanstrien Follow In this blog post, we outline the challenges and solutions involved in generating a synthetic dataset with billions of tokens to replicate Phi-1.5, leading to the creation of Cosmopedia. Synthetic data has become a central topic in Machine Learning. It refers to artificially generated data, for instance by large language models (LLMs), to mimic real-world data. Traditionally, creating datasets for supervised fine-tuning and instruction-tuning required the costly and time-consuming process of hiring human annotators. This practice entailed significant resources, limiting the development of such datasets to a few key players in the field. However, the landscape has recently changed. We've seen hundreds of high-quality synthetic fine-tuning datasets developed, primarily using GPT-3.5 and GPT-4. The community has also supported this development with numerous publications that guide the process for various domains, and address the associated challenges [1][2][3][4][5]. Figure 1. Datasets on Hugging Face hub with the tag synthetic. However, this is not another blog post on generating synthetic instruction-tuning datasets, a subject the community is already extensively exploring. We focus on scaling from a few thousand to millions of sampl...

Originally published on February 15, 2026. Curated by AI News.

Open Source Ai

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

A Blog post by IBM Granite on Hugging Face

Hugging Face Blog · 7 min · about 7 hours ago

Llms

My AI spent last night modifying its own codebase

I've been working on a local AI system called Apis that runs completely offline through Ollama. During a background run, Apis identified ...

Reddit - Artificial Intelligence · 1 min · about 12 hours ago

Llms

Depth-first pruning seems to transfer from GPT-2 to Llama (unexpectedly well)

TL;DR: Removing the right transformer layers (instead of shrinking all layers) gives smaller, faster models with minimal quality loss — a...

Reddit - Artificial Intelligence · 1 min · about 14 hours ago

Llms

[2603.16430] EngGPT2: Sovereign, Efficient and Open Intelligence

Abstract page for arXiv paper 2603.16430: EngGPT2: Sovereign, Efficient and Open Intelligence

arXiv - AI · 4 min · about 15 hours ago

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

About this article

Related Articles

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

My AI spent last night modifying its own codebase

Depth-first pruning seems to transfer from GPT-2 to Llama (unexpectedly well)

[2603.16430] EngGPT2: Sovereign, Efficient and Open Intelligence

No comments

Stay updated with AI News