Molmo
modelOpen multimodal VLM family with fully open training data (PixMo). Pioneered image pointing capabilities. Trained in two stages: dense captioning pre-training + supervised fine-tuning for QA, document reading, and pointing. Closes the gap between open and proprietary multimodal systems. Published at CVPR 2025.
Model Details
Architecture DENSE
Paper
arXiv: 2409.17146
Venue: CVPR 2025