11B dense Transformer trained on 5.5T tokens in 4 stages with progressive context extension (2K→8K). 11 languages. VLM variant adds CLIP ViT-L/14 vision encoder. MMLU: 58.4, HellaSwag: 82.9.

Model Details

Architecture DENSE
Parameters 11B
Context window 8,192

Paper

arXiv: 2407.14885

open-weightmultilingual

Related