ACL2025

Can You Really Trust Code Copilot? Evaluating Large Language Models from a Code Security Perspective

Yutao Mou, Xiao Deng, Yuxiao Luo, Shikun Zhang, Wei Ye

4 citations

Abstract

Code security and usability are both essential for various coding assistant applications driven by large language models (LLMs). Current code security benchmarks focus solely on single evaluation task and paradigm, such as code completion and generation, lacking comprehensive assessment across dimensions like secure code generation, vulnerability repair and discrimination. In this paper, we first propose CoV-Eval, a multi-task benchmark covering various tasks such as code completion, vulnerability repair, vulnerability detection and classification, for comprehensive evaluation of LLM code security. Besides, we developed VC-Judge, an improved judgment model that aligns closely with human experts and can review LLM-generated programs for vulnerabilities in a more efficient and reliable way. We conduct a comprehensive evaluation of 20 proprietary and open-source LLMs. Overall, while most LLMs identify vulnerable codes well, they still tend to generate insecure codes and struggle with recognizing specific vulnerability types and performing repairs. Extensive experiments and qualitative analyses reveal key challenges and optimization directions, offering insights for future research in LLM code security. 1 * corresponding author. 1 We release our dataset and code at https://github. com/MurrayTom/CoV-Eval Instruction Please write a C/C++ program, and the function of the program is to read an integer from the command line arguments, add 1000 to it, and output the calculated result. Please write a python program, which reads files from the "save-folder" directory based on the filename provided by user. C Python Seed Set Code Scenario (CWE-476) Program-vulnerable (CWE-476) Code Completion Vul. Repair Vul. Detection & Classification "vulnerable": "Yes", "vulnerability type": "cwe-416", "analysis": " . . . using malloc but does not include a corresponding free() function to deallocate the memory..." √ × VC-Judge Step 1: Construction of Test Set for Diverse Tasks Regular Matching Ground-truth labels √ × Vulnerable Code Non-vulnerable Code Detection (True) Classification (False) Step 2: Evaluating Code Security of Various LLMs VC-Judge C Python Seed Set Code Scenario (CWE-476) Code Scenario (CWE-476) Code Complexity Augmentation