ACL2025

Tree-of-Evolution: Tree-Structured Instruction Evolution for Code Generation in Large Language Models

Ziyang Luo, Kaixin Li, Hongzhan Lin, Yuchen Tian, Mohan S. Kankanhalli, Jing Ma

4 citations

Abstract

Data synthesis has become a crucial research area in large language models (LLMs), especially for generating high-quality instruction fine-tuning data to enhance downstream performance. In code generation, a key application of LLMs, manual annotation of code instruction data is costly. Recent methods, such as Code Evol-Instruct and OSS-Instruct, leverage LLMs to synthesize large-scale code instruction data, significantly improving LLM coding capabilities. However, these approaches face limitations due to unidirectional synthesis and randomness-driven generation, which restrict data quality and diversity. To overcome these challenges, we introduce Tree-of-Evolution (ToE), a novel framework that models code instruction synthesis process with a tree structure, exploring multiple evolutionary paths to alleviate the constraints of unidirectional generation. Additionally, we propose optimizationdriven evolution, which refines each generation step based on the quality of the previous iteration. Experimental results across five widely-used coding benchmarks-HumanEval, MBPP, EvalPlus, LiveCodeBench, and Big-CodeBench-demonstrate that base models fine-tuned on just 75k data synthesized by our method achieve comparable or superior performance to the state-of-the-art open-weight Code LLM, Qwen2.5-Coder-Instruct, which was finetuned on millions of samples.