FSE2025

Towards Understanding Fine-Grained Programming Mistakes and Fixing Patterns in Data Science

Wei-Hao Chen, Jia Lin Cheoh, Manthan Keim, Sabine Brunswicker, Tianyi Zhang

1 citation

Abstract

Programming is an essential activity in data science (DS). Unlike regular software developers, DS programmers often use Jupyter notebooks instead of conventional IDEs. Moreover, DS programmers focus on statistics, data analytics, and modeling rather than writing production-ready code following best practices in software engineering. Thus, in order to provide effective tool support to improve their productivity, it is important to understand what kinds of errors they make and how they fix them. Previous studies have analyzed DS code from public code-sharing platforms such as GitHub and Kaggle. However, they only accounted for code changes committed to the version history, omitting many programming mistakes that are resolved before code commits. To bridge the gap, we present an in-depth analysis of the fine-grained logs of a DS competition, which includes 390 Jupyter Notebooks written by participants over six weeks. In addition, we conducted semi-structured interviews with 10 DS programmers from different domains to understand the reasons behind their programming mistakes. We identified several unique programming mistakes and fix patterns that had not been reported before, highlighting opportunities for designing new tool support for DS programming.