dots.vlm1
modelhi lab's first vision-language model: a 1.2B-parameter NaViT vision encoder trained entirely from scratch (dynamic resolution, joint visual + text supervision) paired with a DeepSeek-V3 MoE language model. Near-SOTA open-source multimodal performance at release — MMMU 80.1, MathVision 69.6, DocVQA 96.5. MIT-licensed.
Model Details
License MIT
Base model deepseek-v3
Benchmark Scores
| Benchmark | Score | Mode |
|---|---|---|
| MMMU | 80.1 | — |
| MathVision | 69.6 | — |
| DocVQA | 96.5 | — |