StarCoder2 and The Stack v2
About this article
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Back to Articles StarCoder2 and The Stack v2 Published February 28, 2024 Update on GitHub Upvote 9 +3 Leandro von Werra lvwerra Follow Loubna Ben Allal loubnabnl Follow Anton Lozhkov anton-l Follow Nouamane Tazi nouamanetazi Follow BigCode is releasing StarCoder2, the next generation of transparently trained open code LLMs. All StarCoder2 variants were trained on The Stack v2, a new large and high-quality code dataset. We release all models, datasets, and the processing as well as the training code. Check out the paper for details. What is StarCoder2? StarCoder2 is a family of open LLMs for code and comes in 3 different sizes with 3B, 7B and 15B parameters. The flagship StarCoder2-15B model is trained on over 4 trillion tokens and 600+ programming languages from The Stack v2. All models use Grouped Query Attention, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and were trained using the Fill-in-the-Middle objective. StarCoder2 offers three model sizes: a 3 billion-parameter model trained by ServiceNow, a 7 billion-parameter model trained by Hugging Face, and a 15 billion-parameter model trained by NVIDIA using NVIDIA NeMo on NVIDIA accelerated infrastructure: StarCoder2-3B was trained on 17 programming languages from The Stack v2 on 3+ trillion tokens. StarCoder2-7B was trained on 17 programming languages from The Stack v2 on 3.5+ trillion tokens. StarCoder2-15B was trained on 600+ programming languages from The Stack v2 on 4+ trillion ...