ICSE2025

Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models

Kunpeng Zhang, Shuai Wang, Jitao Han, Xiaogang Zhu, Xian Li, Shaohua Wang, Sheng Wen

被引用 6 次

摘要

Deep learning (DL) libraries are widely used to form the basis of various AI applications in computer vision, natural language processing, and software engineering domains. Despite their popularity, DL libraries are known to have vulnerabilities, such as buffer overflows, use-after-free, and integer overflows, that can be exploited to compromise the security or effectiveness of the underlying libraries. While traditional fuzzing techniques have been used to find bugs in software, they are not well-suited for DL libraries. In general, the complexity of DL libraries and the diversity of their APIs make it challenging to test them thoroughly. To date, mainstream DL libraries like TensorFlow and PyTorch have featured over 1,000 APIs, and the number of APIs is still growing. Fuzzing all these APIs is a daunting task, especially when considering the complexity of the input data and the diversity of the API usage patterns. Recent advances in large language models (LLMs) have illustrated the high potential of LLMs in understanding and synthesizing human-like code. Despite their high potential, we find that emerging LLM-based fuzzers are less optimal for DL library API fuzzing, given their lack of in-depth knowledge on API input edge cases and inefficiency in generating test inputs. In this paper, we propose DFuzz, a LLM-driven DL library fuzzing approach. We have two key insights: (1) With high reasoning ability, LLMs can replace human experts to reason edge cases (likely error-triggering inputs) from checks in an API's code, and transfer the extracted knowledge to test other (new or rarely-tested) APIs. (2) With high generation ability, LLMs can synthesize initial test programs with high accuracy that automates API testing. DFuzz provides LLMs with a novel “white-box view” of DL library APIs, and therefore, can leverage LLMs' reasoning and generation abilities to achieve comprehensive fuzzing. Our experimental results on popular DL libraries demonstrate that DFuzz is able to cover more APIs than SOTA (LLM-based) fuzzers on TensorFlow and PyTorch, respectively. Moreover, DFuzz successfully detected 37 bugs, with 8 already fixed and 19 replicated by the developer but still under investigation.