ICLR2023

Multi-lingual Evaluation of Code Generation Models

Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton, Arash Farahani, Siddhartha Jain, Robert Giaquinto, Haifeng Qian, Murali Krishna Ramanathan, Ramesh Nallapati

28 citations

DOI arXiv Publisher

Abstract

We present new benchmarks for evaluating code generation models: MBXP, Multilingual HumanEval, and MathQA-X. These datasets encompass over 10 programming languages and are generated using a scalable conversion framework that transpiles prompts and test cases from the original Python datasets into the corresponding data in the target language. With these benchmarks, we can assess the performance of code generation models in a multilingual context, uncovering the generalization ability of language models on out-of-domain languages, the advantages of multilingual models over monolingual ones, the potential of few-shot prompting to teach models new languages, and zero-shot translation capabilities, even in monolingual settings. Additionally, we utilize our code generation model for large-scale bootstrapping to obtain synthetic canonical solutions in various languages, which can be employed for other code-related evaluations, such as code insertion, robustness, or summarization tasks. Overall, our benchmarks represent a significant step towards a deeper understanding of language models' code generation abilities. We publicly release our code and datasets at https://github.com/amazon-research/mxeval .