ASE2024

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

Jia Feng, Jiachen Liu, Cuiyun Gao, Chun Yong Chong, Chaozheng Wang, Shan Gao, Xin Xia

被引用 7 次

摘要

In recent years, with the widespread attention of academia and industry on the application of large language models (LLMs) to coderelated tasks, an increasing number of large code models (LCMs) have been proposed and corresponding evaluation benchmarks have continually emerged. Although existing evaluation benchmarks are helpful for comparing different LCMs, they may not reflect the performance of LCMs in various development scenarios. Specifically, they might evaluate model performance in only one type of scenario (e.g., code generation or code completion), whereas real development contexts are diverse and may involve multiple tasks such as code generation, code completion, API recommendation, and test function generation. Additionally, the questions may not originate from actual development practices, failing to capture the programming challenges faced by developers during the development process. To address the aforementioned issues, we propose Complex-CodeEval, a new benchmark for evaluating the performance of LCMs in various development scenarios. ComplexCodeEval includes 3,897 Java samples from 1,055 high-star GitHub repositories and 7,184 Python samples from 2,107 high-star repositories. Each function sample in ComplexCodeEval contains multiple annotations (e.g., function signatures, docstrings and reference APIs) to accommodate various downstream tasks. Furthermore, to better * Corresponding author. The author is also affiliated with Peng Cheng Laboratory.