[2603.17074] PRISM: Demystifying Retention and Interaction in Mid-Training
About this article
Abstract page for arXiv paper 2603.17074: PRISM: Demystifying Retention and Interaction in Mid-Training
Computer Science > Machine Learning arXiv:2603.17074 (cs) [Submitted on 17 Mar 2026 (v1), last revised 21 Mar 2026 (this version, v2)] Title:PRISM: Demystifying Retention and Interaction in Mid-Training Authors:Bharat Runwal, Ashish Agrawal, Anurag Roy, Rameswar Panda View a PDF of the paper titled PRISM: Demystifying Retention and Interaction in Mid-Training, by Bharat Runwal and 3 other authors View PDF HTML (experimental) Abstract:We present PRISM, a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we show that mid-training on approximately 27B high-quality tokens yields consistent gains of +15 to +40 points on math, +5 to +12 points on code, and +6 to +13 points on science benchmarks while preserving general performance. The full PRISM to RL pipeline improves macro-average across six reasoning benchmarks from under 12 to 29-42 (a 3-4x improvement), whereas RL applied directly to most of the base models remains substantially less effective, with AIME scores near zero. Data composition matters most at mid-training, not RL: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. Mechanistically, mid-train...