Megatron-LM
libraryNVIDIA's distributed training framework implementing tensor, pipeline, and data parallelism (PTD-P). Achieves 52% peak device throughput on 1000s of GPUs. Underpins virtually all large-scale model training — used by NVIDIA, many Chinese labs, and other organizations worldwide.
Paper
arXiv: 1909.08053