OLMo Hybrid | Lab Index

First fully open hybrid Transformer + linear RNN model at scale. Gated DeltaNet + Transformer attention at 3:1 ratio (3 DeltaNet layers per 1 attention layer). 7B parameters, trained on 6T tokens (OLMo 3 data mix) using 512 GPUs (H100s migrated to HGX B200s mid-training).

Reaches OLMo 3 7B quality in 35% fewer tokens (Common Crawl loss) and matches MMLU in 49% fewer tokens. 75% better inference throughput and memory on long contexts. Demonstrates hybrid models are more expressive than either architecture alone. Apache 2.0.

Blog Post HuggingFace

Model Details

Architecture DENSE

Parameters 7B

Training tokens 6T

open-sourceopen-weightarchitectureefficiency

Model Details

Related