MoE model with 32 experts (2 active), 40B total / 3.7B active parameters. Introduced "Attention Router" for expert selection, achieving 3.8% accuracy improvement over classical routers. Surpassed Llama3-70B on MATH and ARC-Challenge while requiring 1/19th the compute. Trained on 2T tokens.

Model Details

Architecture MOE
Parameters 40B
Active params 3.7B

Paper

arXiv: 2405.17976

moeopen-weightefficiency

Related