Open VLM family expanding Molmo's image strengths to video and multi-image understanding. 4B (Qwen 3), 8B (Qwen 3), and 7B-O (OLMo backbone, fully open end-to-end). 7 new video datasets + 2 multi-image datasets collected without closed VLMs.

SOTA open model for video tracking, leapfrogging Gemini 3 Pro. 8B surpassed prior 72B Molmo on image QA. Capabilities: video QA, video counting, video tracking, and point-driven grounding across single image, multi-image, and video.

Model Details

Architecture DENSE

Variants

Name Parameters Notes
Molmo 2 4B 4B
Molmo 2 8B 8B
Molmo 2-O 7B 7B OLMo backbone, fully open

Paper

arXiv: 2601.10611

multimodalvisionvideoopen-sourceopen-weight

Related