[2505.02881] Rewriting Pre-Training Data Boosts LLM Performance in Math and Code
About this article
Abstract page for arXiv paper 2505.02881: Rewriting Pre-Training Data Boosts LLM Performance in Math and Code
Computer Science > Machine Learning arXiv:2505.02881 (cs) [Submitted on 5 May 2025 (v1), last revised 1 Mar 2026 (this version, v4)] Title:Rewriting Pre-Training Data Boosts LLM Performance in Math and Code Authors:Kazuki Fujii, Yukito Tajima, Sakae Mizuki, Masaki Kawamura, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Masanari Oi, Taishi Nakamura, Takumi Okamoto, Shigeki Ishida, Kakeru Hattori, Youmi Ma, Hiroya Takamura, Rio Yokota, Jun Sakuma, Naoaki Okazaki View a PDF of the paper titled Rewriting Pre-Training Data Boosts LLM Performance in Math and Code, by Kazuki Fujii and 16 other authors View PDF HTML (experimental) Abstract:The performance of large language models (LLMs) in program synthesis and mathematical reasoning is fundamentally limited by the quality of their pre-training corpora. We introduce two openly licensed pre-training datasets, released under the Llama 3.3 Community License, that significantly enhance LLM performance by systematically rewriting public data. SwallowCode ($\approx$16.1 billion tokens) refines Python snippets from The-Stack-v2 through a novel four-stage pipeline: syntax validation, pylint-based style filtering, and a two-stage LLM rewriting process that enforces style conformity and transforms snippets into self-contained, algorithmically efficient examples. Unlike prior methods that rely on exclusionary filtering or limited transformations, our transform-and-retain approach refines low-quality code, maximizing data utility. SwallowMa...