ASE2025

Grammar- and Coverage-based Augmentation of Programs for Training LLMs

Shin Saito, Takaaki Tateishi, Yasuharu Katsuno

摘要

Training large language models (LLMs) for programming tasks, particularly code translation, requires diverse and syntactically valid code dataset. While data augmentation can enhance generalization, uncontrolled augmentation leads to overfitting or invalid examples. In this paper, we introduce a grammar- and coverage-based augmentation method that systematically generates syntactically valid code taking the coverage of grammar rules into account. This approach ensures both syntactic correctness and diversity in the code dataset, while suppressing excessive data augmentation. Our preliminary experiment demonstrates that our method produces well-distributed training data, contributing to improved representation of the underlying grammar.