ASE2021

DSInfoSearch: Supporting Experimentation Process of Data Scientists

Shangeetha Sivasothy

Abstract

Experimentation plays an important role in the work of data scientists to explore unfamiliar problem domains, to answer questions from data, and to develop diverse machine learning applications. Good experimentation requires creativity, is based on prior results and informed from the literature. However, finding relevant information from online sources to guide experimentation causes inefficiencies for data scientists. The objective of this research is to help data scientists through the presentation of context aware ranked data science experiments, considering problem domain, development task and learning task. Data science experiments for this study were extracted from publicly available interactive notebooks and were manually annotated based on a taxonomy of data science techniques and a meta model of a data science experiment. Further, the ranking algorithm was developed for data science experiments for given problem domain and development task. As a result, a tool was developed to demonstrate context aware ranked data science experiments for given problem domains such as natural language processing, computer vision and time series and for development stages such as feature engineering and model selection. This study shows that tools and techniques can be designed to be aware of the data science context, in fact, much more so than for software engineering tools. This study supports these efforts by providing knowledge that can improve experimentation process of data scientists.