EMNLP2025

UnitCoder: Scalable Code Synthesis from Pre-training Corpora

Yichuan Ma, Yunfan Shao, Peiji Li, Demin Song, Qipeng Guo, Linyang Li, Xipeng Qiu, Kai Chen

被引用 2 次

摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Despite the abundant sources of code data, constructing high-quality training datasets at scale poses a significant challenge. Pre-training code data typically suffers from inconsistent data quality issues. Conversely, instruction-based methods which use a high-quality subset as seed samples suffer from limited task diversity. In this paper, we introduce UnitCoder, which directly supervises pre-training data quality through automatically generated unit tests, while ensuring the correctness via an iterative fix and refine flow. Code synthesized by Unit-Coder benefits from both the diversity of pretraining corpora and the high quality ensured by unit test supervision. Our experiments demonstrate that models fine-tuned on our synthetic dataset exhibit consistent performance improvements. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora, demonstrating the potential for producing diverse and high-quality post-training data at scale. All code and data will be released 1 .