Open-source vision-language model that outperformed models ten times its size (like Qwen-72B) in multimodal reasoning. Includes SFT and RL variants.

Outputs 2

MiMo-VL-7B

model
Architecture DENSE
Parameters 7B

MiMo-VL: From Pre-training to Post-training

paper

Technical report on achieving SOTA multimodal reasoning at the 7B scale.

arXiv: 2506.03569

multimodalreasoningopen-weight