ICML2025

Automated Benchmark Generation for Repository-Level Coding Tasks

Konstantinos Vergopoulos, Mark Niklas Müller, Martin T. Vechev

Abstract

Code Agent development is an extremely active research area, where a reliable performance metric is critical for tracking progress and guiding new developments. This demand is underscored by the meteoric rise in popularity of SWE-Bench. This benchmark challenges code agents to generate patches addressing GitHub issues given the full repository as context. The correctness of generated patches is then evaluated by executing a human-written test suite extracted from the repository after the issue's resolution.