LRM-Eval / ROME
datasetContamination-free evaluation of large reasoning models (LRMs) across 10 text reasoning tasks and 8 visual reasoning tasks. ROME provides 281 image-question pairs testing visual reasoning from visual clues. One of the most comprehensive LRM evaluations, benchmarking GPT-5, o3, o4-mini, Gemini, Claude, DeepSeek, Qwen, etc. Identifies concerning signals of misaligned thinking and answers, where the reasoning process implies uncertainty but the model gives deterministic answers.