End-to-end MoE training and inference system that addressed the two main barriers to Mixture-of-Experts adoption: massive model size and high inference latency. Introduces the Pyramid-Residual MoE (PR-MoE) architecture and Mixture-of-Students distillation, reducing MoE model size by up to 3.7× while preserving quality.

The DeepSpeed-MoE inference engine combines expert, tensor, and data parallelism with optimized communication and kernels to deliver 7.3× better latency and cost over prior MoE inference solutions, and 4.5× faster / 9× cheaper inference than quality-equivalent dense models. Demonstrated 5× training cost reduction for autoregressive LMs versus dense baselines. ICML 2022. By Rajbhandari, Li, Yao, Zhang, Aminabadi, Awan, Rasley, He.

Paper

Venue ICML 2022
foundationalinfrastructuremoeefficiency

Related