Vision-language foundation model with a 532M vision encoder and 20B-active MoE LLM. State-of-the-art on 38 of 60 public VLM benchmarks. Excels at GUI control, gameplay, and visual reasoning tasks.

Model Details

Architecture MOE
Parameters 200B
Active params 20B

Paper

arXiv: 2505.07062

multimodalvisionmoeagentic