ICLR2025

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, Lilian Weng

9 citations

DOI arXiv Publisher

Abstract

This paper develops a theory of search stability for long-running agents operating under finite active context, delayed verification, sparse expensive feedback, path-dependent lock-in, and lossy state compression. The focus is not only on model quality, but on the mesoscopic law layer that governs how an agent should preserve, retire, substitute, compress, branch, and reset competing hypotheses or route summaries over time. The framework models search state as an active hypothesis portfolio partitioned into coarse families under a context budget. Each item carries promise, verification lag, retention cost, staleness, overlap burden, and inertia. A central contribution is a set-valued adequacy semantics: within each discrimination window, the system is associated with a nonempty random set of operationally adequate families induced by the realized initial information state and downstream randomness. Success is defined as preserving recoverability of at least one adequate family at the first strongly discriminating verification stage, avoiding dependence on a selector-defined pseudo-truth. The paper derives threshold and impossibility results for context contamination, shadow retirement, delayed-verification coverage, reserve feasibility, and budget-limited adequacy. It also develops a theory of within-family semantic substitution, compressed-control alias hazard, reset admissibility, stale-legacy drift, diagnostic regret decomposition, and rolling-window lifting for long-running agents with repeated verification stages and changing task modes. The intended contribution is an audit-and-design law layer for bounded-memory AI systems. The theory is deliberately narrow and conditional, but it aims to make long-horizon agent failures more diagnosable: separating failures caused by bounded-memory hypothesis ecology from failures caused by raw model weakness, and from mixtures of both.