ICLR2026

Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression

Tingkai Yan, Haodong Wen, Binghui Li, Kairong Luo, Wenguang Chen, Kaifeng Lyu

8 citations

Abstract

While data scaling laws of large language models (LLMs) have been widely examined in the one-pass regime with massive corpora, their form under limited data and repeated epochs remains largely unexplored. This paper presents a theoretical analysis of how a common workaround, training for multiple epochs on the same dataset, reshapes the data scaling laws in linear regression. Concretely, we ask: to match the performance of training on a dataset of size NN for KK epochs, how much larger must a dataset be if the model is trained for only one pass? We quantify this using the effective reuse rate\textit{effective reuse rate} of the data, E(K,N)E(K, N), which we define as the multiplicative factor by which the dataset must grow under one-pass training to achieve the same test loss as KK-epoch training. Our analysis precisely characterizes the scaling behavior of E(K,N)E(K, N) for SGD in linear regression under either strong convexity or Zipf-distributed data: (1) When KK is small, we prove that E(K,N)KE(K, N) \approx K, indicating that every new epoch yields a linear gain; (2) As KK increases, E(K,N)E(K, N) plateaus at a problem-dependent value that grows with NN (Θ(logN)\Theta(\log N) for the strongly-convex case), implying that larger datasets can be repeated more times before the marginal benefit vanishes. These theoretical findings point out a neglected factor in a recent empirical study by Muennighoff et al. (2023), which claimed that training LLMs for up to 44 epochs results in negligible loss differences compared to using fresh data at each step, i.e.\textit{i.e.}, E(K,N)KE(K, N) \approx K for K4K \le 4 in our notation. Supported by further empirical validation with LLMs, our results reveal that the maximum KK value for which E(K,N)KE(K, N) \approx K in fact depends on the data size and distribution, and underscore the need to explicitly model both factors in future studies of scaling laws with data reuse.