309B MoE model (15B active) optimized for extreme inference speed (150 tokens/sec) and low-cost API access. Introduces a hybrid attention architecture interleaving Sliding Window Attention with global attention (5:1 ratio), achieving nearly 6x reduction in KV-cache storage. Pre-trained on 27 trillion tokens with Multi-Token Prediction. Uses Multi-Teacher On-Policy Distillation (MOPD) for efficient post-training. Rivals DeepSeek-V3.2 and Kimi-K2 while using 1/2 to 1/3 of their parameters.

Outputs 2

MiMo-V2-Flash

model
Architecture MOE
Parameters 309B
Active params 15B

MiMo-V2-Flash: Unlocking Extreme Inference Efficiency

paper

Technical report detailing hybrid SWA + global attention, Multi-Token Prediction, and Multi-Teacher On-Policy Distillation (MOPD).

arXiv: 2601.02780

moeefficiencyopen-weight

Related