Seed1.5-VL | Lab Index

Vision-language foundation model with a 532M vision encoder and 20B-active MoE LLM. State-of-the-art on 38 of 60 public VLM benchmarks. Excels at GUI control, gameplay, and visual reasoning tasks.

Paper (arXiv)GitHub HuggingFace Demo

Model Details

Architecture MOE

Parameters 200B

Active params 20B

Paper

arXiv: 2505.07062

multimodalvisionmoeagentic