Open-source vision-language model (106B total, 12B active via MoE). Supports 64K tokens for multi-image and video inputs. SOTA on 42 public VL benchmarks at its scale. MIT-licensed.

Model Details

Architecture MOE
Parameters 106B
Active params 12B
Context window 64,000
multimodalmoeopen-weightreasoning