Open multimodal VLM family with fully open training data (PixMo). Pioneered image pointing capabilities. Trained in two stages: dense captioning pre-training + supervised fine-tuning for QA, document reading, and pointing. Closes the gap between open and proprietary multimodal systems. Published at CVPR 2025.

Model Details

Architecture DENSE

Paper

Venue CVPR 2025
multimodalvisionopen-sourceopen-weight

Related