"OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling." Studies why Llama-family base models historically respond worse to RL post-training than Qwen-family bases, and shows the gap can be closed with the right mid-training recipe — the slice of pretraining between the base run and SFT/RL. Two findings drive the result: (1) high-quality mathematical corpora boost both base eval and RL gains; and (2) a Stable-then-Decay schedule (200B constant-rate tokens followed by 20B decay-phase tokens) is the most effective curriculum.

Releases the MegaMath-Web-Pro-Max corpus (70B+ tokens of curated math) and the OctoThinker open-weight family at 1B / 3B / 8B sizes (Long and Hybrid variants, Base and Zero each) built on Llama-3.2 bases. From GAIR Lab at SII / Shanghai Innovation Institute, by Zengzhi Wang, Fan Zhou, Xuefeng Li, and Pengfei Liu.

Model Details

Variants

Name Parameters Notes
OctoThinker-1B-Long-Base / Long-Zero / Hybrid-Base / Hybrid-Zero 1B Built on meta-llama/Llama-3.2-1B
OctoThinker-3B-Long-Base / Long-Zero / Hybrid-Base / Hybrid-Zero 3B Built on meta-llama/Llama-3.2-3B
OctoThinker-8B-Long-Base / Hybrid-Base 8B 8B parameter variants for the recipe

Paper

Authors: Zengzhi Wang · Fan Zhou · Xuefeng Li · Pengfei Liu
reasoningtrainingopen-weightfoundational