Llms Machine Learning Ai Infrastructure Data Science

[2602.17288] ArXiv-to-Model: A Practical Study of Scientific LM Training

arXiv - AI February 20, 2026 3 min read Article

Summary

This article presents a detailed study on training a 1.36B-parameter scientific language model from raw arXiv LaTeX sources, focusing on the engineering processes involved.

Why It Matters

The study addresses the gap in documentation regarding the practical training of domain-specific language models, providing insights that can help researchers with limited resources to develop their own models effectively. It emphasizes the importance of preprocessing and infrastructure in model training.

Key Takeaways

Preprocessing decisions significantly impact the volume of usable tokens.
Tokenization affects the stability of symbolic representations in models.
Storage and I/O constraints can be as limiting as compute resources.
The study provides a transparent account of training a scientific language model.
Insights are particularly valuable for researchers with moderate compute budgets.

Computer Science > Artificial Intelligence arXiv:2602.17288 (cs) [Submitted on 19 Feb 2026] Title:ArXiv-to-Model: A Practical Study of Scientific LM Training Authors:Anuj Gupta View a PDF of the paper titled ArXiv-to-Model: A Practical Study of Scientific LM Training, by Anuj Gupta View PDF HTML (experimental) Abstract:While frontier large language models demonstrate strong reasoning and mathematical capabilities, the practical process of training domain-specialized scientific language models from raw sources remains under-documented. In this work, we present a detailed case study of training a 1.36B-parameter scientific language model directly from raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics. We describe an end-to-end pipeline covering metadata filtering, archive validation, LaTeX extraction, text normalization, domain-aware tokenization, and dense transformer training under constrained compute (2xA100 GPUs). Through 24 experimental runs, we analyze training stability, scaling behavior, data yield losses, and infrastructure bottlenecks. Our findings highlight how preprocessing decisions significantly affect usable token volume, how tokenization impacts symbolic stability, and how storage and I/O constraints can rival compute as limiting factors. We further analyze convergence dynamics and show stable training behavior in a data-rich regime (52B pretraining tokens). Rather than proposing a novel architecture, this work provides a...

Read Original Article

[2602.17288] ArXiv-to-Model: A Practical Study of Scientific LM Training

Summary

Why It Matters

Key Takeaways

Related Articles

Claude on Claude

Cut Claude usage by ~85% in a job search pipeline (16k → 900 tokens/app) — here’s what worked

"Authoritarian Parents In Rationalist Clothes": a piece I wrote in December about alignment

AI joins the 8-hour work day as GLM ships 5.1 open source LLM, beating Opus 4.6 and GPT-5.4 on SWE-Bench Pro

No comments

Stay updated with AI News