12B multimodal model with 400M vision encoder trained from scratch. Processes images at natural resolution and aspect ratio. MMMU: 52.5%, DocVQA: 90.7%. Outperforms Llama 3.2 90B while being 7x smaller. 128K context. Apache 2.0.

Model Details

Architecture DENSE
Parameters 12B
Context window 128,000

Paper

arXiv: 2410.07073

multimodalvisionopen-weight