dots.vlm1 | Lab Index

hi lab's first vision-language model: a 1.2B-parameter NaViT vision encoder trained entirely from scratch (dynamic resolution, joint visual + text supervision) paired with a DeepSeek-V3 MoE language model. Near-SOTA open-source multimodal performance at release — MMMU 80.1, MathVision 69.6, DocVQA 96.5. MIT-licensed.

No results found