[2505.00022] Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation

[2505.00022] Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation

arXiv - AI 4 min read

About this article

Abstract page for arXiv paper 2505.00022: Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation

Computer Science > Computation and Language arXiv:2505.00022 (cs) [Submitted on 24 Apr 2025 (v1), last revised 31 Mar 2026 (this version, v3)] Title:Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation Authors:Thomas F Burns, Letitia Parcalabescu, Stephan Wäldchen, Michael Barlow, Gregor Ziegltrum, Volker Stampa, Bastian Harren, Björn Deiseroth View a PDF of the paper titled Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation, by Thomas F Burns and 7 other authors View PDF HTML (experimental) Abstract:Scaling data quantity is essential for large language models (LLMs), yet recent findings show that data quality can significantly boost performance and training efficiency. We introduce a German-language dataset curation pipeline that combines heuristic and model-based filtering techniques with synthetic data generation. We use our pipeline to create Aleph-Alpha-GermanWeb, a 628B-word German pre-training dataset composed of three subsets drawing from: (1) Common Crawl web data (organic subset; 78B words), (2) FineWeb2 (organic subset; 235B), and (3) synthetically-generated data conditioned on actual, organic web data (synthetic subset; 329B). We evaluate our dataset by pre-training both a 1B Llama-style model and an 8B tokeniser-free hierarchical autoregressive transformer (HAT) from scratch. A comparison on German-language bench...

Originally published on April 01, 2026. Curated by AI News.

Related Articles

Llms

AI helped me build a custom PC and 4 apps in 6 months with zero coding experience

Mid-October, early morning at work. I was hunting for a podcast to throw on while I worked and stumbled into something about what AI coul...

Reddit - Artificial Intelligence · 1 min ·
Llms

I thought of something while cooking up a simple RL AI. Please Validate it. [R]

So, I was trying to build a simple AI when I thought of, 'How could I give an AI some emotions? ' This led to one thing after another, an...

Reddit - Machine Learning · 1 min ·
Llms

Open-source list of GenAI-related incidents

I am sharing this open-source list of cases where the ethics of GenAI use were put in the spotlight, in the hopes of sparking discussion ...

Reddit - Artificial Intelligence · 1 min ·
Llms

I built a repo for implementing and training LLM architectures from scratch in minimal PyTorch — contributions welcome! [P]

Hey everyone, I've been working on a repo where I implement large language model architectures using the simplest PyTorch code possible. ...

Reddit - Machine Learning · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime