Systematic study of 4-bit floating-point (FP4) training for LLMs on Huawei's next-generation Ascend NPUs. Compares two FP4 formats: HiFloat4 (Huawei's hierarchical scaling format) and MXFP4 (Microscaling). Tests both dense models (PanGu, LLaMA-style architectures) and MoE models with FP4 linear and expert GEMMs.

Develops stabilization techniques that keep training loss within 1% relative error of full-precision baselines while delivering 4× throughput and memory efficiency gains. Represents the Ascend ecosystem's answer to NVIDIA's NVFP4, closing the gap on ultra-low-precision training infrastructure for frontier-scale LLMs on non-NVIDIA hardware.

Paper

infrastructureefficiencytraining-stability

Related