75 ML engineering competitions from Kaggle testing whether AI agents can train models, prepare datasets, and run experiments autonomously. Each task is a real Kaggle competition with a defined metric and test set. Agents operate in sandboxed environments with code execution and file I/O. Scored by Kaggle medal thresholds (bronze/silver/gold).

o1-preview with AIDE scaffolding achieves bronze in 16.9% of competitions. ICLR 2025. A key benchmark for measuring the ML research automation capability that labs are racing toward. By Chan, Fishman, Korinek et al. (OpenAI).

Paper

Venue ICLR 2025
benchmarkevaluationcodingagentic