[2602.17288] ArXiv-to-Model: A Practical Study of Scientific LM Training

[2602.17288] ArXiv-to-Model: A Practical Study of Scientific LM Training

arXiv - AI 3 min read Article

Summary

This article presents a detailed study on training a 1.36B-parameter scientific language model from raw arXiv LaTeX sources, focusing on the engineering processes involved.

Why It Matters

The study addresses the gap in documentation regarding the practical training of domain-specific language models, providing insights that can help researchers with limited resources to develop their own models effectively. It emphasizes the importance of preprocessing and infrastructure in model training.

Key Takeaways

  • Preprocessing decisions significantly impact the volume of usable tokens.
  • Tokenization affects the stability of symbolic representations in models.
  • Storage and I/O constraints can be as limiting as compute resources.
  • The study provides a transparent account of training a scientific language model.
  • Insights are particularly valuable for researchers with moderate compute budgets.

Computer Science > Artificial Intelligence arXiv:2602.17288 (cs) [Submitted on 19 Feb 2026] Title:ArXiv-to-Model: A Practical Study of Scientific LM Training Authors:Anuj Gupta View a PDF of the paper titled ArXiv-to-Model: A Practical Study of Scientific LM Training, by Anuj Gupta View PDF HTML (experimental) Abstract:While frontier large language models demonstrate strong reasoning and mathematical capabilities, the practical process of training domain-specialized scientific language models from raw sources remains under-documented. In this work, we present a detailed case study of training a 1.36B-parameter scientific language model directly from raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics. We describe an end-to-end pipeline covering metadata filtering, archive validation, LaTeX extraction, text normalization, domain-aware tokenization, and dense transformer training under constrained compute (2xA100 GPUs). Through 24 experimental runs, we analyze training stability, scaling behavior, data yield losses, and infrastructure bottlenecks. Our findings highlight how preprocessing decisions significantly affect usable token volume, how tokenization impacts symbolic stability, and how storage and I/O constraints can rival compute as limiting factors. We further analyze convergence dynamics and show stable training behavior in a data-rich regime (52B pretraining tokens). Rather than proposing a novel architecture, this work provides a...

Related Articles

Llms

Claude on Claude

The Story of Anthropic’s Latest Controversies Regarding the Business of Its Prized Creation… As Told by the Thing Itself. Editor’s note: ...

Reddit - Artificial Intelligence · 1 min ·
Llms

Cut Claude usage by ~85% in a job search pipeline (16k → 900 tokens/app) — here’s what worked

Like many here, I kept running into Claude usage limits when building anything non-trivial. I was working with a job search automation pi...

Reddit - Artificial Intelligence · 1 min ·
Llms

"Authoritarian Parents In Rationalist Clothes": a piece I wrote in December about alignment

Posted today in light of the Claude Mythos model card release. Originally I wrote this for r/ControlProblem but realized it was getting o...

Reddit - Artificial Intelligence · 1 min ·
Llms

AI joins the 8-hour work day as GLM ships 5.1 open source LLM, beating Opus 4.6 and GPT-5.4 on SWE-Bench Pro

AI Tools & Products ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime