DeepSpeed-MoE | Lab Index

End-to-end MoE training and inference system that addressed the two main barriers to Mixture-of-Experts adoption: massive model size and high inference latency. Introduces the Pyramid-Residual MoE (PR-MoE) architecture and Mixture-of-Students distillation, reducing MoE model size by up to 3.7× while preserving quality.

The DeepSpeed-MoE inference engine combines expert, tensor, and data parallelism with optimized communication and kernels to deliver 7.3× better latency and cost over prior MoE inference solutions, and 4.5× faster / 9× cheaper inference than quality-equivalent dense models. Demonstrated 5× training cost reduction for autoregressive LMs versus dense baselines. ICML 2022. By Rajbhandari, Li, Yao, Zhang, Aminabadi, Awan, Rasley, He.

Paper (arXiv)GitHub

Paper

Venue ICML 2022

arXiv HTML

foundationalinfrastructuremoeefficiency

Paper

Related