OLMo Hybrid
modelFirst fully open hybrid Transformer + linear RNN model at scale. Gated DeltaNet + Transformer attention at 3:1 ratio (3 DeltaNet layers per 1 attention layer). 7B parameters, trained on 6T tokens (OLMo 3 data mix) using 512 GPUs (H100s migrated to HGX B200s mid-training).
Reaches OLMo 3 7B quality in 35% fewer tokens (Common Crawl loss) and matches MMLU in 49% fewer tokens. 75% better inference throughput and memory on long contexts. Demonstrates hybrid models are more expressive than either architecture alone. Apache 2.0.
Model Details
Architecture DENSE
Parameters 7B