ISSTA2025

Enhanced Prompting Framework for Code Summarization with Large Language Models

Minying Fang, Xing Yuan, Yuying Li, Haojie Li, Chunrong Fang, Junwei Du

3 citations

Abstract

Code summarization is essential for enhancing the efficiency of software development, enabling developers to swiftly comprehend and maintain software projects. Recent efforts utilizing large language models for generating precise code summaries have shown promising performance, primarily due to their advanced generative capabilities. LLMs that employ continuous prompting techniques can explore a broader problem space, potentially unlocking greater capabilities. However, they also present specific challenges, particularly in aligning with task-specific situations—a strength of discrete prompts. Additionally, the inherent differences between programming languages and natural languages can complicate comprehension for LLMs, impacting the accuracy and relevance of the summaries in complex programming scenarios. These challenges may result in outputs that do not align with actual task needs, underscoring the necessity for further research to enhance the effectiveness of LLMs in code summarization. To overcome these limitations, we combine the strengths of the two approaches described above and introduce EP4CS—an Enhanced Prompting framework for Code Summarization with Large Language Models. Firstly, we design Mapper, which undergoes pre-training on <Code, Knowledge> pairs and facilitates the optimization and updating of prompt vectors based on the outputs of LLMs. Additionally, we develop a Struct-Agent that enables LLMs to more accurately interpret the complex code by in-depth analysis of the programming language’s syntax and structure. Experimental results indicate that, compared to existing baseline methods, our enhanced prompting learning framework significantly improves performance while maintaining the same parameter scale. Specifically, when evaluated on Java using StarCoderBase 1 B , EP4CS achieved score improvements of 6.59% on BLEU, 7.06% on METEOR, and 4.43% on ROUGE-L, while also demonstrating strong robustness. And it’s closer to real-world scenarios in terms of semantic metrics SentenceBERT. The results from the human evaluation and case studies show that EP4CS surpasses the baseline methods, producing higher-quality and more relevant summaries.