Hybrid Mamba-Transformer architecture: 56B model has 54 Mamba-2 layers + 54 MLP layers + 10 self-attention layers (118 total), 8192 hidden dim, 64 query heads, 8 KV heads. Largest public FP8 pre-training: 56B model trained on 20T tokens in FP8.

Up to 3x faster inference than comparable Transformers (vs Qwen-2.5-72B, Llama-3.1-70B). 47B variant (compressed via MiniPuzzle) is 20% faster than 56B with minimal quality loss. MMLU: 84.21, ARC-C: 94.97.

Model Details

Architecture DENSE
Parameters 56B

Variants

Name Parameters Notes
Nemotron-H-8B 8B
Nemotron-H-47B 47B Compressed via MiniPuzzle
Nemotron-H-56B 56B

Paper

arXiv: 2504.03624

open-weightarchitectureefficiency

Related