ACL2024

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, Xiang Yue

Abstract

The introduction of large language models has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address this, we introduce OpenCodeInterpreter, a family of open-source code systems designed for generating, executing, and iteratively refining code. Supported by Code-Feedback, a dataset featuring 68K multi-turn interactions, OpenCodeInterpreter integrates execution and human feedback for dynamic code refinement. Our comprehensive evaluation of OpenCodeInterpreter across key benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus reveals its exceptional performance. Notably, OpenCodeInterpreter-33B achieves an accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, closely rivaling GPT-4's 84.2 (76.2) and further elevates to 91.6 (84.6) with synthesized human feedback from GPT-4. OpenCodeInterpreter brings the gap between open-source code generation models and proprietary systems like GPT-4 Code Interpreter. 1 to 91.6 (84.6). OpenCodeInterpreter thereby es-098 tablishes a new benchmark in code generation, ef-099 fectively narrowing the performance gap between 100 open-source models and sophisticated proprietary 101 systems like the GPT-4 Code Interpreter. 102 2 Code-Feedback 103 In this section, we detail the creation of our code in-104 struction tuning dataset, Code-Feedback (Figure 2), 105 designed to train OpenCodeInterpreter. Code-106 Feedback is crafted to meet specific criteria: 1) 107 Diverse and challenging real-world queries: The 108 dataset should encompass a wide range of queries 109 derived from real-world coding tasks, presenting 110 both diversity and complexity. 2) Multi-turn di-111 alogue structure: Code-Feedback is structured 112 as multi-turn dialogues, incorporating two types 113 of feedback: execution feedback, which includes 114 outputs and diagnostics from compilers, and hu-115 man feedback, consisting of additional guidance 116 or instructions from users. 3) Interleaved text 117 and code responses: Each response is expected 118 to provide responses that blend natural language 119 explanations with code snippets, offering a holistic 120 approach to solving coding queries. 121 To assemble a dataset that fulfills these desider-122 ata, we have employed five distinct methods. Ex-123 amples of these five categories can be found in Ap-124 pendix E. The sources of our queries fall into two 125 main categories: a variety of open-source datasets 126 and coding challenges from LeetCode. In the next 127 subsections, we will discuss how we develop data 128 construction methods to meet the three aforemen-129 tioned criteria from the two data sources. 130 2.1 Coding Queries from Open-source Data 131 We have aggregated 287k queries from four dis-132 tinguished open-source code instruction tuning 133 datasets: Magicoder-OSS-Instruct 2 , Python code 134 subset of ShareGPT 3 , Magicoder-Evol-Instruct 4 , 135 and Evol-Instruct-Code 5 . To refine this exten-136 sive collection and isolate the most intricate and 137 informative instructions, we employ a very capa-138 ble open-source chat model, Qwen-72B-Chat (Bai 139 et al., 2023), for a selective filtering process. This