ACL2024

DebugBench: Evaluating Debugging Capability of Large Language Models

Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Haotian Hui, Weichuan Liu, Zhiyuan Liu, Maosong Sun

Abstract

Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs' debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce 'DebugBench', an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct De-bugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and three open-source models in a zero-shot scenario. We find that (1) while closed-source models exhibit inferior debugging performance compared to humans, open-source models fail to attain any pass rate scores; (2) the complexity of debugging notably fluctuates depending on the bug category; (3) incorporating runtime feedback has a clear impact on debugging performance which is not always helpful. As an extension, we also compare LLM debugging and code generation, revealing a strong correlation between them for closed-source models. These findings will benefit the development of LLMs in debugging. 1 041 crucial component in programming, consuming 042 35-50% of the development duration and 50-75% 043 of the total budget (McConnell, 2004). However, 044 unlike coding, the debugging abilities of LLMs 045 remain relatively unexplored. 046 One primary obstacle in code debugging re-047 search is the lack of evaluation benchmarks. While 048 some basic evaluations (Prenner et al., 2022; Soba-049 nia et al., 2023; Xia and Zhang, 2023b; Zhang et al., 050 2023) verify the effectiveness of LLM-based de-051 bugging methods, these evaluations have notable 052 limitations that prevent us from comprehensively 053 assessing the debugging capabilities of LLMs as 054 exhibited in Table 1. First, as Zhang et al. (2023) re- 055 vealed, existing debugging benchmarks (Just et al., 056 2014; Lin et al., 2017) have been more or less CodeLlama-34b-instruct (Rozière et al., 2023) and BLOOM (Workshop et al., 2022) in zero-100 shot scenarios. Our empirical study reveals: (1) 101 LLM debugging falls short of human perfor-102 mance. Open-source models attain a pass rate of 0 103 %, struggling to produce meaningful debugging re-104 sponses. Closed-source LLMs significantly surpass 105 open-source ones but still fall short of human-level 106 performance; (2) The difficulty of fixing different 107 types of errors differs. Multiple errors and logical 108 errors are significantly more challenging to repair 109 than syntax and reference errors; (3) Runtime feed-110 back has a clear impact on LLM's debugging 111 performance but is not always helpful. While 112 runtime feedback consistently boosts the debug-113 ging performance of syntax and reference bugs, the 114 feedback information is unhelpful for logic errors. 115 To gain deeper insights into the overall pro-116 gramming capabilities of LLMs, we also compare 117 closed-source models' performance on debugging 118 and code generation. Experimental results indicate 119 that for closed-source models: (1) fixing syntax or 120 reference errors is generally easier than code gener-121 ation, while repairing logical or multiple errors can 122 be equally hard or even harder; (2) the debugging 123 and code generation performance of LLMs are cor-124 related, which indicates the abilities of LLMs to 125 approach these two tasks are positively related. All 126 these findings are crucial for comprehending the 127 debugging capabilities of LLMs and developing 128 more comprehensive code models. 129 2 Benchmark Construction 130 As illustrated in Figure 2, to construct DebugBench, 131 we first collect questions, code snippets, and exam-132 ples from LeetCode (2023) community, then em-133 ploy GPT-4 (OpenAI, 2023) for bug implantation. 134 To ensure the integrity of the benchmark, we con-135 duct automatic filtering and final human inspection. 136 2.1 Formulation of Debugging 137 Consider the input-output pairs (x i , y i ) where each 138 x i is a program input and y i is the corresponding 139 Figure 2: This figure illustrates the construction of DebugBench. We first collect code snippets from LeetCode (2023) community, then employ GPT-4 (OpenAI, 2023) for bug implantation and finally conduct human / LLM evaluation on the benchmark. Automatic filtering and final human inspection are conducted to ensure integrity of the benchmark. The figure also provides qualitative cases for code snippets, bug instances, and evaluation samples. More examples are accessible in Appendix H. desired output, together they compose a set R that 140 defines the programming problem. 141 Let a θ (x) = y denote a program a, based on a 142 code script θ, that maps an input x to an output y. 143 W