A breakthrough 8B-parameter multimodal model (built on Qwen2-7B) that surpassed GPT-4V in single-image, multi-image, and video understanding tasks. It supports real-time inference on mobile devices and iPads, introducing advanced spatio-temporal compression for video processing and Needle-in-a-Haystack retrieval for long-context multimodal inputs.

Model Details

Parameters 8B

Paper

arXiv: 2408.01800

on-devicemultimodalvisionvideoopen-weight

Related