ISSTA2025
Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection
Niklas Risse, Jing Liu, Marcel Böhme
被引用 8 次
摘要
According to our survey of machine learning for vulnerability detection (ML4VD), 9 in every 10 papers published in the past !ve years de!ne ML4VD as a function-level binary classi!cation problem: Given a function, does it contain a security !aw? From our experience as security researchers, faced with deciding whether a given function makes the program vulnerable to attacks, we would often !rst want to understand the context in which this function is called. In this paper, we study how often this decision can really be made without further context and study both vulnerable and non-vulnerable functions in the most popular ML4VD datasets. We call a function "vulnerable" if it was involved in a patch of an actual security "aw and con!rmed to cause the program's vulnerability. It is "non-vulnerable" otherwise. We !nd that in almost all cases this decision cannot be made without further context. Vulnerable functions are often vulnerable only because a corresponding vulnerability-inducing calling context exists while non-vulnerable functions would often be vulnerable if a corresponding context existed. But why do ML4VD techniques achieve high scores even though there is demonstrably not enough information in these samples? Spurious correlations: We !nd that high scores can be achieved even when only word counts are available. This shows that these datasets can be exploited to achieve high scores without actually detecting any security vulnerabilities. We conclude that the prevailing problem statement of ML4VD is ill-de!ned and call into question the internal validity of this growing body of work. Constructively, we call for more e#ective benchmarking methodologies to evaluate the true capabilities of ML4VD, propose alternative problem statements, and examine broader implications for the evaluation of machine learning and programming analysis research. CCS Concepts: • Security and privacy → Software and application security; • Software and its engineering → Software testing and debugging; • Computing methodologies → Machine learning.