ACL2024
CodeM: Less Data Yields More Versatility via Ability Matrix
Daoguang Zan, Ailun Yu, Wei Liu, Bo Shen, Shaoxin Lin, Yongshun Gong, Yafen Yao, Yan Liu, Bei Guan, Weihua Luo, Yongji Wang, Qianxiang Wang, Lizhen Cui
Abstract
In the era of code large language models (code LLMs), data engineering plays a pivotal role during the instruction fine-tuning phase. To train a versatile model, previous efforts devote tremendous efforts to crafting instruction data that covers all the downstream scenarios. Nonetheless, this will incur significant expenses in data construction and model training. Therefore, this paper introduces CODEM, a novel data construction strategy, which can efficiently train a versatile model using less data via our newly proposed ability matrix. CODEM uses ability matrix to decouple code LLMs' abilities into two dimensions, constructing a lightweight training corpus that only covers a subset of target scenarios. Extensive experiments on HumanEvalPack and MultiPL-E reveal that code LLMs can combine the singledimensional abilities to master composed abilities, validating the effectiveness of CODEM.