[2602.17288] ArXiv-to-Model: A Practical Study of Scientific LM Training
Summary
This article presents a detailed study on training a 1.36B-parameter scientific language model from raw arXiv LaTeX sources, focusing on the engineering processes involved.
Why It Matters
The study addresses the gap in documentation regarding the practical training of domain-specific language models, providing insights that can help researchers with limited resources to develop their own models effectively. It emphasizes the importance of preprocessing and infrastructure in model training.
Key Takeaways
- Preprocessing decisions significantly impact the volume of usable tokens.
- Tokenization affects the stability of symbolic representations in models.
- Storage and I/O constraints can be as limiting as compute resources.
- The study provides a transparent account of training a scientific language model.
- Insights are particularly valuable for researchers with moderate compute budgets.
Computer Science > Artificial Intelligence arXiv:2602.17288 (cs) [Submitted on 19 Feb 2026] Title:ArXiv-to-Model: A Practical Study of Scientific LM Training Authors:Anuj Gupta View a PDF of the paper titled ArXiv-to-Model: A Practical Study of Scientific LM Training, by Anuj Gupta View PDF HTML (experimental) Abstract:While frontier large language models demonstrate strong reasoning and mathematical capabilities, the practical process of training domain-specialized scientific language models from raw sources remains under-documented. In this work, we present a detailed case study of training a 1.36B-parameter scientific language model directly from raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics. We describe an end-to-end pipeline covering metadata filtering, archive validation, LaTeX extraction, text normalization, domain-aware tokenization, and dense transformer training under constrained compute (2xA100 GPUs). Through 24 experimental runs, we analyze training stability, scaling behavior, data yield losses, and infrastructure bottlenecks. Our findings highlight how preprocessing decisions significantly affect usable token volume, how tokenization impacts symbolic stability, and how storage and I/O constraints can rival compute as limiting factors. We further analyze convergence dynamics and show stable training behavior in a data-rich regime (52B pretraining tokens). Rather than proposing a novel architecture, this work provides a...