Pixtral 12B | Lab Index

12B multimodal model with 400M vision encoder trained from scratch. Processes images at natural resolution and aspect ratio. MMMU: 52.5%, DocVQA: 90.7%. Outperforms Llama 3.2 90B while being 7x smaller. 128K context. Apache 2.0.

Paper (arXiv)HuggingFace

Model Details

Architecture DENSE

Parameters 12B

Context window 128,000

Paper

arXiv HTML

multimodalvisionopen-weight