ICML2024

QuRating: Selecting High-Quality Data for Training Language Models

Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen

138 citations

Abstract

Selecting high-quality pre-training data is important for creating capable language models, but existing methods rely on simple heuristics. We introduce QuRating, a method for selecting pretraining data that can capture human intuitions about data quality. In this paper, we investigate four qualities-writing style, required expertise, facts & trivia, and educational value-and find that LLMs are able to discern these qualities, especially when making pairwise judgments of texts. We train a QuRater model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria. In our experiments, we select 30B tokens according to the different quality ratings and train 1.3B-parameter language models on the selected data. We find that it is important to balance quality and diversity. When we sample using quality ratings as logits over documents, our models obtain lower perplexity and stronger in-context learning performance than baselines. Our best model is based on educational value and performs similarly to a model trained with uniform sampling for 50% more steps. Beyond data selection, we use the quality ratings to construct a training curriculum which improves performance without changing the training dataset. We extensively analyze the quality ratings and discuss their characteristics, biases, and wider implications. 2 2 0 2 4 Writing Style Wiki-en -de -ru Book StackEx. Github ArXiv 4 2 0 2 4 Facts & Trivia 4 2 0 2 4 Educational Value 4 2 0 2 4 Required Expertise Figure 4. Distribution of quality ratings, normalized for each criterion to have zero mean and unit standard deviation across the corpus. protein, gene and energy, climate, species are rated highly on required expertise, educational value, and facts & trivia. Meanwhile, the book, author cluster tends to obtain high ratings in writing style. However, almost all clusters encompass a wide range of quality ratings. Comparison to perplexity filtering. We compare sequencelevel log-likelihood scores from Llama-2-7b (Touvron et al., 2023b) with the quality ratings across 1M training sequences and visualize the relationship in Figure 7 in the appendix. We observe that documents with low quality ratings have a wide range of likelihoods, and the Spearman correlation coefficient varies between 0.50 for writing style to -0.02 for required expertise. Therefore, QuRating is meaningfully different from selecting texts based on perplexity scores from a strong LLM (Marion et al., 2023). Data Inspection We study raw documents from each of the domains and clusters discussed in Section 6.1. We select training examples at the 5th, 30th, 70th and 95th percentile for each criterion, and feature random extracts in Appendix F without any cherrypicking. While this is a minute sliver of the training data, the documents still exhibit clear qualitative differences and we invite the reader to inspect them in the appendix.