CUBE: A Standard for Unifying Agent Benchmarks

A proposed standard for tackling agent-benchmark fragmentation: CUBE is described as "a universal protocol standard built on MCP and Gym that allows benchmarks to be wrapped once and used everywhere." It separates task, benchmark, package, and registry concerns into distinct API layers, so a benchmark can be integrated a single time and then run across any compliant evaluation platform without bespoke glue code.

A 10-page position paper with an open-source reference implementation (The-AI-Alliance/cube-standard). Led by ServiceNow AI Research (first author Alexandre Lacoste, with Nicolas Gontier, Massimo Caccia, Alexandre Drouin and colleagues), the authorship spans the AI Alliance — IBM Research (Leshem Choshen, Asaf Yehudai, Michal Shmueli-Scheuer, Elron Bandel), Mila / McGill (Siva Reddy, Xing Han Lù), CMU (Graham Neubig), UC Berkeley (Dawn Song), HKU (Tao Yu), and Ohio State (Yu Su), alongside startups Silverstream.ai and Jetty.

Paper (arXiv)GitHub (reference implementation)

Paper

arXiv HTML Code

Authors: Alexandre Lacoste · Nicolas Gontier · Oleh Shliazhko · Aman Jaiswal · Kusha Sareen · Shailesh Nanisetty · Joan Cabezas · Manuel Del Verme · Omar G. Younis · Simone Baratta · Matteo Avalle · Imene Kerboua · Xing Han Lù · Elron Bandel · Michal Shmueli-Scheuer · Asaf Yehudai · Leshem Choshen · Jonathan Lebensold · Sean Hughes · Massimo Caccia · Alexandre Drouin · Siva Reddy · Tao Yu · Yu Su · Graham Neubig · Dawn Song

agentsbenchmarkevaluationinfrastructureresearch