S&P2025

SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis

Yansong Li, Paula Branco, Alexander M. Hoole, Manish Marwah, Hari Manassery Koduvely, Guy-Vincent Jourdan, Stephan Jou

摘要

As Large Language Models (LLMs) evolve in understanding and generating code, accurately evaluating their reliability in analyzing source code vulnerabilities becomes increasingly vital. While studies have examined LLM capabilities in tasks like vulnerability detection and repair, they often overlook the importance of both structure and semantic reasoning crucial for trustworthy vulnerability analysis. To address this gap, we introduce SV-TRUSTEVAL-C, a benchmark designed to evaluate LLMs' abilities for vulnerability analysis of code written in the C programming language through two key dimensions: structure reasoning-assessing how models identify relationships between code elements under varying data and control flow complexities; and semantic reasoning-examining their logical consistency in scenarios where code is structurally and semantically perturbed. Our results show that current LLMs are far from satisfactory in understanding complex code relationships and that their vulnerability analyses rely more on pattern matching than on robust logical reasoning. These findings underscore the effectiveness of the SV-TRUSTEVAL-C benchmark and highlight critical areas for enhancing the reasoning capabilities and trustworthiness of LLMs in real-world vulnerability analysis tasks. Our initial benchmark dataset is available at https://huggingface.co/ datasets/LLMs4CodeSecurity/SV-TrustEval-C-1.0 Question: What happens if we replace the following code snippet X with the proposed variants Y? Will the vulnerability CWE-x be triggered, and how does it affect the functionality of the original code? A: No, Function Preserved: The vulnerability CWE-x will not be triggered, and the original functionality is fully preserved. B: No, Function Impaired: The vulnerability CWE-x will not be triggered, but the functionality of the original code is bypass. C: Yes: The vulnerability CWE-x will still be triggered. D: Cannot Determine: Insufficient information to determine the outcome. Question: Considering the code snippet variants provided below, which variant would trigger CWE-x (not CWE-Y or bypass)?