OctoThinker
paper"OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling." Studies why Llama-family base models historically respond worse to RL post-training than Qwen-family bases, and shows the gap can be closed with the right mid-training recipe — the slice of pretraining between the base run and SFT/RL. Two findings drive the result: (1) high-quality mathematical corpora boost both base eval and RL gains; and (2) a Stable-then-Decay schedule (200B constant-rate tokens followed by 20B decay-phase tokens) is the most effective curriculum.
Releases the MegaMath-Web-Pro-Max corpus (70B+ tokens of curated math) and the OctoThinker open-weight family at 1B / 3B / 8B sizes (Long and Hybrid variants, Base and Zero each) built on Llama-3.2 bases. From GAIR Lab at SII / Shanghai Innovation Institute, by Zengzhi Wang, Fan Zhou, Xuefeng Li, and Pengfei Liu.
Model Details
Variants
| Name | Parameters | Notes |
|---|---|---|
| OctoThinker-1B-Long-Base / Long-Zero / Hybrid-Base / Hybrid-Zero | 1B | Built on meta-llama/Llama-3.2-1B |
| OctoThinker-3B-Long-Base / Long-Zero / Hybrid-Base / Hybrid-Zero | 3B | Built on meta-llama/Llama-3.2-3B |
| OctoThinker-8B-Long-Base / Hybrid-Base | 8B | 8B parameter variants for the recipe |