ASE2025
How Does ChatGPT Make Assumptions When Creating Erroneous Programs?
Sadia Jahan, Xiaoyin Wang
Abstract
Large Language Models (LLMs) like ChatGPT are increasingly integrated into software development environments due to their strong performance in code generation. However, they often struggle with complex logic, security vulnerabilities, and code quality issues. These problems frequently originate from misunderstandings of problem requirements and logical inconsistencies, which can lead to faulty or vulnerable software. In this study, we conduct an initial empirical analysis to investigate the causes of erroneous code generated by the state-of-the-art LLM model GPT-4o. Using the HumanEval dataset, we prompt GPT-4o to generate Python solutions and list its 3 most important assumptions. We validate these outputs against the provided test cases in dataset and identify 17 defective programs out of 164 total solutions. By analyzing the 17 failures and 51 assumptions made on these tasks, we find that about 53% the failures are directly related to wrong or erroneously implemented assumptions raised by the GPT model itself, and totally 71% of code generation failures are related to erroneously made or implemented assumptions.