ASE2025
Should We Evaluate LLM Based Security Analysis Approaches on Open Source Systems?
Kohei Dozono, Jonas Engesser, Benjamin Hummel, Tobias Roehm, Alexander Pretschner
被引用 1 次
摘要
Existing research has demonstrated promising results when applying large language models (LLMs) to detect security vulnerabilities in source code. However, these studies have been exclusively evaluated on benchmarks from open-source systems, using publicly known vulnerabilities that are likely part of the LLMs' training data. This raises concerns that reported performance metrics may be inflated due to data contamination, providing a misleading view of the models' actual capabilities. In this paper, we quantify this effect with a case study that evaluates five frontier LLMs on two carefully curated datasets: CWE-Bench-Java (an open-source dataset) and TS-Vuls (based on a closed-source commercial codebase). To provide a second angle, we also split CWE-Bench-Java by CVE record date to explore temporal contamination based on LLM knowledge cutoff dates. Our results reveal that the average F1 score dropped by approximately 20 percentage points when comparing the opensource to the closed-source dataset. Additionally, the precision drops from 56% to 34% on average, which is statistically significant (p < 0.05) for four of five models. This declining trend is consistent across all tested LLMs and metrics. In contrast, the results for the temporal split on the open-source data are inconclusive, suggesting that using a knowledge cutoff may reduce but does not ensure the elimination of contamination effects. Although our study is based on a single closed-source system and thus not generalizable, these findings provide the first empirical evidence that evaluating LLM-based vulnerability detection on open-source benchmarks may lead to overly optimistic results. This motivates the inclusion of closed-source datasets in future LLM evaluations.