Comprehensive benchmark for vision-language models.

Evaluation Details

Questions 2,374
Tasks 14
Domains 7
Scoring Paired binary yes/no questions (two per image); subtask score = accuracy + accuracy+ (accuracy+ counts an image only if both its questions are answered correctly), max 200 per subtask; full scores are 2000 for perception and 800 for cognition
Random baseline 50% accuracy / 25% accuracy+ (paired yes/no questions per image)
Domains: coarse-grained object perception (existence, count, position, color), fine-grained recognition (movie posters, celebrities, scenes, landmarks, artworks), OCR, commonsense reasoning, numerical calculation, text translation, code reasoning
benchmarkmultimodalevaluation