Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models
About this article
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Back to Articles Cosmopedia: how to create large-scale synthetic data for pre-training Published March 20, 2024 Update on GitHub Upvote 109 +103 Loubna Ben Allal loubnabnl Follow Anton Lozhkov anton-l Follow Daniel van Strien davanstrien Follow In this blog post, we outline the challenges and solutions involved in generating a synthetic dataset with billions of tokens to replicate Phi-1.5, leading to the creation of Cosmopedia. Synthetic data has become a central topic in Machine Learning. It refers to artificially generated data, for instance by large language models (LLMs), to mimic real-world data. Traditionally, creating datasets for supervised fine-tuning and instruction-tuning required the costly and time-consuming process of hiring human annotators. This practice entailed significant resources, limiting the development of such datasets to a few key players in the field. However, the landscape has recently changed. We've seen hundreds of high-quality synthetic fine-tuning datasets developed, primarily using GPT-3.5 and GPT-4. The community has also supported this development with numerous publications that guide the process for various domains, and address the associated challenges [1][2][3][4][5]. Figure 1. Datasets on Hugging Face hub with the tag synthetic. However, this is not another blog post on generating synthetic instruction-tuning datasets, a subject the community is already extensively exploring. We focus on scaling from a few thousand to millions of sampl...