ISSTA2025

Causality-Aided Evaluation and Explanation of Large Language Model-Based Code Generation

Zhenlan Ji, Pingchuan Ma, Zongjie Li, Zhaoyu Wang, Shuai Wang

被引用 1 次

摘要

While code generation has been widely used in various software development scenarios, the quality of the generated code is not guaranteed. This has been a particular concern in the era of large language models (LLM)-based code generation, where LLMs, deemed a complex and powerful black-box model, are instructed by a high-level natural language specification, namely a prompt, to generate code. Nevertheless, effectively evaluating and explaining the code generation capability of LLMs is inherently challenging, given the complexity of LLMs and the lack of transparency. Inspired by recent progress in causality analysis and its software engineering applications, this paper proposes a causality-driven approach to systematically analyze prompt-code causal relationships. However, this endeavor faces three key technical challenges: (1) representing textual prompts and code in a canonical form, (2) establishing causal relations between high-level concepts and code features, and (3) systematically analyzing diverse prompt variations. To address these challenges, we first propose a novel causal graph-based representation of the prompt and the generated code, which is established over the fine-grained, human-understandable concepts in the input prompts. The formed causal graph is then used to identify the causal relations between the prompt and the derived code. We illustrate the insights that our framework can provide by studying over four popular LLMs with over 12 prompt adjustment strategies. The results of these studies illustrate the potential of our technique to provide insights into LLM effectiveness and aid end-users in understanding predictions. Additionally, we demonstrate that our approach provides actionable insights to improve the quality of the LLM-generated code by properly calibrating the prompt.