Pixtral 12B
model12B multimodal model with 400M vision encoder trained from scratch. Processes images at natural resolution and aspect ratio. MMMU: 52.5%, DocVQA: 90.7%. Outperforms Llama 3.2 90B while being 7x smaller. 128K context. Apache 2.0.
Model Details
Architecture DENSE
Parameters 12B
Context window 128,000
Paper
arXiv: 2410.07073