ACL2025
Overlapping Context with Variable-Length Stride Increases Diversity when Training Large Language Model for Code
Geonmo Gu, Jaeho Kwak, Haksoo Moon, Hyun Seung Shim, Yu Jin Kim, Byoungjip Kim, Moontae Lee, Hyejeong Jeon
Abstract
The pretraining of code LLMs typically begins with general data and progresses to domainspecific data through sequential stages. In the latter stages, a challenging issue is that the data of a target domain can be limited in size, and conventional approach of increasing the number of epochs does not lead to a performance gain. In this paper, we propose a novel packing method, which is extracting overlapping contexts from the training data using variablelength stride. Our method can mitigate the datascarcity issue by providing more diverse and abundant examples of next token prediction than non-overlapping contexts. While the training time of our approach is increased proportionally to the amount of augmented examples, we present space-efficient implementations to store overlapping contexts. Extensive experiments with real datasets show that our approach outperforms the conventional approach of controlling the number of epochs in terms of the pass@k rate.