The paper introducing MedXpertQA, a benchmark motivated by the observation that existing medical benchmarks have insufficient difficulty and clinical relevance and suffer from data leakage. The authors collect expert-level questions across 17 specialties and 11 body systems, then apply rigorous filtering and augmentation — including expert review and difficulty/diversity controls — to mitigate leakage and raise the difficulty bar, and add a multimodal (MM) subset with clinical images plus Reasoning and Understanding question types.

Evaluating leading models reveals a large gap to expert-level performance, positioning MedXpertQA as a probe for medical reasoning in frontier and reasoning models. ICML 2025; MIT-licensed and integrated into OpenCompass. By the Tsinghua C3I group (lead authors Yuxin Zuo, Shang Qu) and Shanghai AI Laboratory (senior author Bowen Zhou).

Paper

Venue ICML 2025
Authors: Yuxin Zuo · Shang Qu · Yifei Li · Zhangren Chen · Xuekai Zhu · Ermo Hua · Kaiyan Zhang · Ning Ding · Bowen Zhou
benchmarkevaluationreasoningmedicalresearch

Related