ASE2025

Coverage-Based Harmfulness Testing for LLM Code Transformation

Honghao Tan, Haibo Wang, Diany Pressato, Yisen Xu, Shin Hwei Tan

2 citations

Abstract

Harmful content embedded in program elements within source code may have detrimental impact on mental health of software developers, and promote harmful behavior. Our key insight is that software developers may introduce harmful content into source code via diverse semantic-preserving program transformations when using Code Large Language Models (Code LLMs). To analyze the space of program transformations that may be used to introduce harmful content into auto-generated code, we conduct a preliminary study that revealed 32 different types of transformations that can be used to introduce harmful content in source code. Based on our study, we propose CHT, a novel coverage-based harmfulness testing framework that automatically synthesizes prompts using a set of prompt templates injected with diverse harmful keywords to perform various types of transformations on a set of mined benign programs. Instead of checking if the content moderation has been bypassed as prior testing approaches, CHT performs output damage measurement to assess potential harm that can be incurred by the generated outputs (i.e., natural language explanation and modified code). By considering output damage, CHT revealed several problems in Code LLMs: (1) bugs in content moderation for code (Code LLMs produce the harmful code without providing any warning), (2) inadequacy in performing code-related task (e.g., Code LLMs may resort to explaining the given code instead of performing the instructed transformation task), and (3) lenient content moderation (gives warning but the modified code with harmful content is still produced). Our evaluations of CHT on four Code LLMs and gpt-4o-mini (general LLM) show that content moderation in Code LLMs is relatively easy to bypass where LLMs may generate harmful keywords embedded within identifier names or code comments without giving any warning (65.93% in our evaluation). To improve the robustness of content moderation in code-related tasks, we propose a two-phase approach that checks if the prompt contains any harmful content before generating any output. Our evaluation shows that our proposed approach improves the content moderation of Code LLM by 483.76%.