MiniMax-M1

World's first large-scale hybrid-attention reasoning model (456B total, 45.9B active). Uses Lightning Attention for 1M-token contexts with 75% less compute than rivals. CISPO algorithm for RL training used only 512 H800 GPUs for three weeks (~$530K).

Announcement Paper (arXiv)HuggingFace GitHub

Outputs 2

model

HuggingFace GitHub

Architecture MOE

Parameters 456B

Active params 45.9B

Context window 1,000,000

MiniMax-M1: Scaling Test-Time Compute with Lightning Attention

paper

Paper (arXiv)

arXiv HTML

moereasoningefficiencyopen-weighttrainingattention

Outputs 2

MiniMax-M1

MiniMax-M1: Scaling Test-Time Compute with Lightning Attention

Related