CVPR2024

Scaling Laws for Data Filtering - Data Curation Cannot be Compute Agnostic

Sachin Goyal, Pratyush Maini, Zachary C. Lipton, Aditi Raghunathan, J. Zico Kolter

摘要

Figure 1. (a) The Dynamic Problem of Data Filtering: Web data is non-homogenous, and past work has proposed metrics that ranking various data subsets according to their diminishing quality (y-axis). However, training on 'high-quality' data for multiple epochs leads to diminishing utility (x-axis), an angle ignored in past work. Assume we have compute equivalent to 6 data pools, one could train on the best pool (E) for 6 epochs, or train on the best two pools (E and D) for 2 epochs each, and so on. Our work aims to answer-what is the best allocation of computational resources in such scenarios? (b) Data Filtering Scaling Laws: Our work proposes scaling laws for predicting the model performance on mixtures of data pools of various quality. Note that we do not train on data mixtures to fit the above scaling curves (scatter points are test points), rather the scaling curves are estimated from the scaling parameters of individual pools.