Eliminates padding requirements in FP8 grouped GEMM on NVIDIA Hopper GPUs using a TMA descriptor pool approach. Achieves 1.7-20.4% speedup with up to 23.8% memory reduction compared to padded implementations, improving efficiency for MoE training and inference.

Paper

arXiv: 2508.16584

Library

GitHub Repository

infrastructureefficiencyresearch