ICML2025

Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Mozhi Zhang, Howe Tissue, Lu Wang, Xipeng Qiu

Abstract

The mixture ratio of data from different source domains significantly affects the performance of language models (LM) pretraining. In this paper, we introduce DOMAIN2VEC, a novel approach that decomposes any dataset into a linear combination of several "Meta-Domains", a new concept designed to capture key underlying features of datasets. DOMAIN2VEC maintains a vocabulary of Meta-Domains and uses a Meta-Domain Classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of optimal data mixture ratio for LM pretraining in a training-free manner under the Distribution Alignment Assumption (DA 2 ), which suggests that when the data distribution of the training set and the validation set is more aligned, a lower validation loss is achieved. Moreover, previous work could use DOMAIN2VEC to model the relationship between domain vectors and LM performance, greatly enhancing the scalability of previous methods without retraining as new datasets are introduced. Extensive experiments demonstrate that DOMAIN2VEC finds data mixture ratios that enhance downstream task performance with minimal computational overhead. Specifically, DO-MAIN2VEC achieves the same validation loss on Pile-CC using only 51.5% of the compute required when training on the original mixture of The Pile Dataset. Under equivalent compute budget, DOMAIN2VEC improves downstream performance by an average of 2.72%. DOMAIN2VEC serves as a strong and efficient baseline for data mixture optimization in LM pretraining, offering insights into improving data efficiency in large-scale models.