OLMoE | Lab Index

First fully open MoE LLM: weights, training data (OLMoE-Mix), code, and logs. 6.9B total / 1.3B active per token with 64 experts per layer (top-8 dropless token-choice routing). Trained on 5.13T tokens using 128 H100 GPUs.

Demonstrates fine-grained 64-expert routing outperforms coarser configurations. 2x faster training than equivalent dense model. Competitive with Llama 2 13B at ~6-7x less compute per forward pass. MMLU: 54.1, HellaSwag: 80.0. Apache 2.0.

Paper (arXiv)GitHub HuggingFace

Model Details

Architecture MOE

Parameters 6.9B

Active params 1.3B

Context window 4,096

Training tokens 5.13T

Paper

Citations 7

arXiv HTML

moeopen-sourceopen-weightresearch

Model Details

Paper

Related