World's first large-scale hybrid-attention reasoning model (456B total, 45.9B active). Uses Lightning Attention for 1M-token contexts with 75% less compute than rivals. CISPO algorithm for RL training used only 512 H800 GPUs for three weeks (~$530K).

Outputs 2

MiniMax-M1

model
Architecture MOE
Parameters 456B
Active params 45.9B
Context window 1,000,000

MiniMax-M1: Scaling Test-Time Compute with Lightning Attention

paper

arXiv: 2506.13585

moereasoningefficiencyopen-weighttrainingattention

Related