hi lab's first vision-language model: a 1.2B-parameter NaViT vision encoder trained entirely from scratch (dynamic resolution, joint visual + text supervision) paired with a DeepSeek-V3 MoE language model. Near-SOTA open-source multimodal performance at release — MMMU 80.1, MathVision 69.6, DocVQA 96.5. MIT-licensed.

Model Details

License MIT
Base model deepseek-v3

Benchmark Scores

Benchmark Score Mode
MMMU 80.1
MathVision 69.6
DocVQA 96.5
multimodalvisionopen-weight

Related