Vision-language model series achieving GPT-4V level performance on mobile devices. The first multimodal model deployed natively on a smartphone. Progressed from V 2.0 through V 2.6 with world-class OCR and video understanding.

Outputs 4

MiniCPM-V 2.0

model

First multimodal model deployed natively on a smartphone.

MiniCPM-Llama3-V 2.5

model

GPT-4V level performance in a 9B parameter package with world-class OCR capabilities.

Parameters 9B

MiniCPM-V 2.6

model

Introduced multi-image and video understanding, outperforming GPT-4V on major benchmarks.

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

paper

arXiv: 2408.01800

multimodalon-devicevisionvideo

Related