SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
evalThe most comprehensive graduate-level knowledge reasoning benchmark, spanning 285 disciplines (13 major, 72 fields, 285 subfields) with 26,529 questions — a 130× scale-up over GPQA Diamond's 198 questions in 3 domains. For the first time includes long-tail disciplines such as agriculture, light industry, and service science alongside mainstream STEM. Average of 9.67 answer options per question (vs. 4 in GPQA), making random guessing far harder (~10% vs. 25%).
Uses a Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement. DeepSeek-R1 leads at 61.82%, indicating substantial headroom remains. Developed by the Doubao (Seed) team at ByteDance in collaboration with the M-A-P open-source community.