First fully open MoE LLM: weights, training data (OLMoE-Mix), code, and logs. 6.9B total / 1.3B active per token with 64 experts per layer (top-8 dropless token-choice routing). Trained on 5.13T tokens using 128 H100 GPUs.

Demonstrates fine-grained 64-expert routing outperforms coarser configurations. 2x faster training than equivalent dense model. Competitive with Llama 2 13B at ~6-7x less compute per forward pass. MMLU: 54.1, HellaSwag: 80.0. Apache 2.0.

Model Details

Architecture MOE
Parameters 6.9B
Active params 1.3B
Context window 4,096

Paper

arXiv: 2409.02060

moeopen-sourceopen-weightresearch

Related