ACL2025
Bridge-Coder: Transferring Model Capabilities from High-Resource to Low-Resource Programming Language
Jipeng Zhang, Jianshu Zhang, Yuanzhe Li, Renjie Pi, Rui Pan, Runtao Liu, Ziqiang Zheng, Tong Zhang
被引用 4 次
摘要
Most LLMs universally excel at generating 001 code for high-resource programming languages 002 (HRPLs) like Python , a capability that has 003 become standard due to the abundance of 004 training data. However, they struggle signif-005 icantly with low-resource programming lan-006 guages (LRPLs) such as D , exacerbating the 007 digital divide. This gap limits developers us-008 ing LRPLs from equally benefiting and hinders 009 innovation within underrepresented program-010 ming communities. To make matters worse, 011 manually generating data for LRPLs is highly 012 labor intensive and requires expensive expert 013 effort. In this work, we begin by analyzing the 014 NL-PL Gap, where LLMs’ direct-generated 015 LRPL data often suffers from subpar quality 016 due to the misalignment between natural lan-017 guage (NL) instructions and programming lan-018 guage (PL) outputs. To address this issue, we 019 introduce Bridge-Assist Generation , a method 020 to generate LRPL data utilizing LLM’s general 021 knowledge, HRPL proficiency, and in-context 022 learning capabilities. To further maximize 023 the utility of the generated data, we propose 024 Bridged Alignment to obtain Bridge-Coder . 025 To thoroughly evaluate our approach, we se-026 lect four relatively LRPLs: R , D , Racket , and 027 Bash . Experimental results reveal that Bridge-028 Coder achieves significant improvements over 029 the original model, with average gains of 18.71 030 and 10.81 on two comprehensive benchmarks, 031 M-HumanEval and M-MBPP. 032