Scalable Training of MoE Models with Megatron Core

88-page technical report on scaling Mixture-of-Experts training from billions to trillions of parameters across thousands of GPUs. Covers the full system stack: memory (recomputation, offloading), communication (optimized MoE dispatchers, overlapping), computation (Grouped GEMM, kernel fusions, CUDA Graphs), and Parallel Folding for flexible multi-dimensional parallelism. Includes low-precision training support for FP8 and NVFP4 formats and efficient long-context training.

Benchmarks on latest hardware: DeepSeek-V3-685B at 1,233 TFLOPS/GPU (GB300) and 1,048 TFLOPS/GPU (GB200); Qwen3-235B at 974 TFLOPS/GPU (GB300). Practical guidance for the MoE training challenges that arise from sparsity allowing total parameters to grow much faster than per-token compute. By Yan, Bai, Yao, Liu et al. (NVIDIA, 45 authors).

Paper (arXiv)

Paper

arXiv HTML

infrastructuremoeefficiencytraining-stability

Paper

Related