ACL2024

Revisiting Code Similarity Evaluation with Abstract Syntax Tree Edit Distance

Yewei Song, Cedric Lothritz, Xunzhu Tang, Tegawendé F. Bissyandé, Jacques Klein

摘要

This paper revisits recent code similarity evaluation metrics, particularly focusing on the application of Abstract Syntax Tree (AST) editing distance in diverse programming languages. In particular, we explore the usefulness of these metrics and compare them to traditional sequence similarity metrics. Our experiments showcase the effectiveness of AST editing distance in capturing intricate code structures, revealing a high correlation with established metrics. Furthermore, we explore the strengths and weaknesses of AST editing distance and prompt-based GPT similarity scores in comparison to BLEU score, execution match, and Jaccard Similarity. We propose, optimize, and publish an adaptable metric that demonstrates effectiveness across all tested languages, representing an enhanced version of Tree Similarity of Edit Distance (TSED). 1 Introduction and Related Work In the fields of natural language processing and software engineering, code generation tasks are gaining more and more attention. Assessing the quality of generated code is now critically important, but we still lack evaluation methods other than traditional statistical sequence evaluation methods. Widely used semantic evaluation metrics like BLEU score and Jaccard similarity rely on statistical characteristics, overlooking the intricate grammatical structures and logical relationships inherent in complex programming languages. However, recent developments in the NLP field paved the way for novel evaluation metrics which we explore in this study. For one, the staggering number of powerful large language models (LLMs) such as GPT-3.5/4 (Achiam et al., 2023) revolutionized the NLP landscape and led to noteworthy advancements in the realm of code review and evaluation (Wang et al., 2023; Tang et al., 2024). Another recent study introduced the novel TSED metric and used it to evaluate text-to-SQL tasks (Song et al., 041 2023). For this study, we take advantage of these 042 developments to (1) prompt the GPT-4 model to 043 generate similarity scores for code, and (2) expand 044 on the TSED metric. 045 We utilize these two different metrics (GPT and 046 TSED) to evaluate the structural similarity of differ-047 ent programming languages and how they relate to 048 execution matches. Furthermore, we address how 049 these metrics are correlated to semantic similarity 050 metrics like the BLEU score. Finally, we investi-051 gate some limitations of these metrics by delving 052 into the impact of TSED's penalty weight of tree 053 operations on evaluation accuracy and exploring 054 the stability of outputs from the GPT LLMs. 055 As a result, we have these 3 contributions from 056 this research: (a) we propose and publish a new tool 057 for 48 programming languages 1 , (b) we discuss 2 058 recent evaluation metrics and 2 traditional metrics 059 and compare them via correlation coefficient, recall 060 to execution match, (c) we discuss the unstable 061 nature of GPT similarity scoring and the ways to 062 optimize TSED. 063 2 Approaches 064 2.1 TSED on Programming Languages 065 Applying the TSED evaluation method, initially 066 designed for SQL analysis, we have undergone 067 modifications to extend its applicability to various 068 programming languages. The fundamental TSED 069 approach, illustrated in Figure 1, encompasses AST 070 parsing, AST Editing Distance Calculation, and 071 normalization, closely resembling the methodology 072 outlined in the original paper. However, we have 073 made modifications to both the AST parsing and 074 normalization.