Zamba2-VL (Vision-Language)

Vision-language extension of the Zamba2 hybrid-SSM family, in 1.2B, 2.7B, and 7B sizes. A Mamba2 state-space backbone interleaved with a few shared transformer blocks is paired with the Qwen2.5-VL Vision Transformer (chosen for 2D rotary embeddings and native dynamic-resolution processing) via a two-layer MLP adapter. Open weights (Apache 2.0).

The hybrid backbone is the point: Zamba2-VL reaches competitive VLM scores at roughly an order-of-magnitude lower time-to-first-token than the closest Transformer baseline (most pronounced on 32K-token prefills). It is especially strong on visual counting (the 1.2B variant scores 62.5 on PixMoCount, nearly double comparable Transformer baselines) and on document and chart understanding. Not currently scored on Artificial Analysis.

Blog (Zyphra)HuggingFace

Model Details

Architecture DENSE

Parameters 7B

License Apache 2.0

Variants

Name	Parameters	Notes
Zamba2-VL-1.2B	1.2B	—
Zamba2-VL-2.7B	2.7B	—
Zamba2-VL-7B	7B	—

multimodalvisionhybrid-ssmopen-weightefficiency

Model Details

Variants

Related