Nemotron-H | Lab Index

Hybrid Mamba-Transformer architecture: 56B model has 54 Mamba-2 layers + 54 MLP layers + 10 self-attention layers (118 total), 8192 hidden dim, 64 query heads, 8 KV heads. Largest public FP8 pre-training: 56B model trained on 20T tokens in FP8.

Up to 3x faster inference than comparable Transformers (vs Qwen-2.5-72B, Llama-3.1-70B). 47B variant (compressed via MiniPuzzle) is 20% faster than 56B with minimal quality loss. MMLU: 84.21, ARC-C: 94.97.

Paper (arXiv)HuggingFace (56B)HuggingFace (8B)

Model Details

Architecture DENSE

Parameters 56B

Training tokens 20T

Variants

Name	Parameters	Notes
Nemotron-H-8B	8B	—
Nemotron-H-47B	47B	Compressed via MiniPuzzle
Nemotron-H-56B	56B	—

Paper

arXiv HTML

open-weightarchitectureefficiency

Model Details

Variants

Paper

Related