ASE2024

What Makes a High-Quality Training Dataset for Large Language Models: A Practitioners' Perspective

Xiao Yu, Zexian Zhang, Feifei Niu, Xing Hu, Xin Xia, John Grundy

15 citations

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance in various application domains, largely due to their self-supervised pre-training on extensive high-quality text datasets. However, despite the importance of constructing such datasets, many leading LLMs lack documentation of their dataset construction and training procedures, leaving LLM practitioners with a limited understanding of what makes a high-quality training dataset for LLMs. To fill this gap, we initially identified 18 characteristics of high-quality LLM training datasets, as well as 10 potential data pre-processing methods and 6 data quality assessment methods, through detailed interviews with 13 experienced LLM professionals. We then surveyed 219 LLM practitioners from 23 countries across 5 continents. We asked our survey respondents to rate the importance of these characteristics, provide a rationale for their ratings, specify the key data pre-processing and data quality assessment methods they used, and highlight the challenges encountered during these processes. From our analysis, we identified 13 crucial characteristics of high-quality LLM datasets that receive a high rating, accompanied by key rationale provided by respondents. We also identified some widely-used data pre-processing and data quality assessment methods, along with 7 challenges encountered during these processes. Based on our findings, we discuss the implications for researchers and practitioners aiming to construct high-quality training datasets for optimizing LLMs.