EMNLP2025

Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors

Zhiyu Yang, Shuo Wang, Yukun Yan, Yang Deng

Abstract

LLMs are transforming software development, yet most code benchmarks still emphasize syntactic or functional correctness in simple, single-error cases. These settings miss the core difficulty of real-world data science debugging, where runtime errors propagate across multiple lines (multi-hop) and often appear in sets (multi-bug). We introduce DSDBench: Data Science Debugging Benchmark, the first benchmark to systematically evaluate LLMs on this challenge. Unlike general debugging benchmark suites such as SWE-bench, DSD-Bench targets non-expert, data-centric scripting, where practitioners rely heavily on blackbox libraries and write exploratory code that is error-prone and difficult to debug. Evaluations of state-of-the-art LLMs reveal large performance gaps: even frontier models that excel at code generation fail to reliably trace and resolve these errors, exposing a critical "generation versus understanding" gap. DSDBench provides a resource to drive progress toward more robust and trustworthy AI-assisted data science. 1 cause_error_line: y_pred = model.predict(X_train) effect_error_line (different from cause): mse = mean_squared_error(y_test, y_pred) error_message: ValueError: Found input variables with inconsistent numbers of samples cause_error_line: X = imputer.fit_transform(y) effect_error_line (different from cause): model.fit(X_train, y_train) error_message: ValueError: Input y contains NaN.