NVIDIA's distributed training framework implementing tensor, pipeline, and data parallelism (PTD-P). Achieves 52% peak device throughput on 1000s of GPUs. Underpins virtually all large-scale model training — used by NVIDIA, many Chinese labs, and other organizations worldwide.

Paper

Citations 835

Library

Stars 16.7k
infrastructureopen-sourcetraining