PRISM: Demystifying Retention and Interaction in Mid-Training

A large controlled empirical study of mid-training design choices for LLMs — the phase between pretraining and RL post-training. PRISM runs experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention–Mamba hybrid), and scales from 3B to 24B, decomposing the decisions into their principal axes: data composition, timing, domain interaction, benchmark selection, RL compatibility, and scaling.

Key findings: mid-training on only ~27B high-quality tokens yields consistent gains of +15 to +40 on math, +5–12 on code, and +6–13 on science while preserving general performance; the full PRISM→RL pipeline lifts the six-benchmark reasoning macro-average from under 12 to 29–42 (a 3–4× improvement), whereas RL applied directly to most base models stays weak (AIME near zero). Crucially, data composition matters most at mid-training, not RL — including science data during mid-training unlocks +17 to +28 GPQA-Diamond gains under RL, while changing the RL mix moves results by <2 points. Mechanistically, mid-training densely restructures >90% of weights while RL makes sparse, front-loaded refinements to ~5%. By IBM Research and the MIT-IBM Watson AI Lab (senior author Rameswar Panda).

Paper (arXiv)IBM Research blog Project page

Paper

arXiv HTML

Authors: Bharat Runwal · Ashish Agrawal · Anurag Roy · Rameswar Panda

post-trainingreinforcement-learningreasoningresearch