First Nemotron 3 generation models. Hybrid LatentMoE + Mamba-2 + Attention architecture. The 30B-A3B variant has 23 Mamba-2 layers + 23 MoE layers (128 routed + 1 shared expert, 6 active) + 6 GQA attention layers. 30B total / 3.5B active. Trained on 25T tokens. 1M context window.

AIME25: 89.1 (no tools) / 99.2 (with tools), GPQA: 75.0 (tools), LiveCodeBench: 68.3, MMLU-Pro: 78.3, RULER@1M: 86.3. The 4B variant (compressed from 9B via structured pruning) fits on 8GB Jetson Orin Nano for edge deployment. MATH-500: 95.4 in reasoning mode.

Model Details

Architecture MOE
Parameters 30B
Active params 3.5B
Context window 1,000,000

Variants

Name Parameters Notes
Nemotron 3 Nano 30B-A3B 30B
Nemotron 3 Nano 4B 4B Edge deployment, 262K context

Paper

arXiv: 2512.20856

moeopen-weightreasoningefficiency

Related