ACL2025

Code-Switching Curriculum Learning for Multilingual Transfer in LLMs

Haneul Yoo, Cheonbok Park, Sangdoo Yun, Alice Oh, Hwaran Lee

被引用 25 次

摘要

Large language models (LLMs) now exhibit near human-level performance in various tasks, but their performance drops drastically after a handful of high-resource languages due to the imbalance in pre-training data. Inspired by the human process of second language acquisition, particularly code-switching-the practice of language alternation in a conversationwe propose code-switching curriculum learning (CSCL) to enhance cross-lingual transfer for LLMs. CSCL mimics the stages of human language learning by progressively training models with a curriculum consisting of 1) token-level code-switching, 2) sentence-level code-switching, and 3) monolingual corpora. Using Qwen 2 as our underlying model, we demonstrate the efficacy of the CSCL in improving language transfer to Korean, achieving significant performance gains compared to monolingual continual pre-training methods. Ablation studies reveal that both token-and sentence-level code-switching significantly enhance cross-lingual transfer and that curriculum learning amplifies these effects. We also extend our findings into various languages, including Japanese (high-resource) and Indonesian (lowresource), and using two additional models (Gemma 2 and Phi 3.5). We further show that CSCL mitigates spurious correlations between language resources and safety alignment, presenting a robust, efficient framework for more equitable language transfer in LLMs. We observe that CSCL is effective for low-resource settings where high-quality, monolingual corpora for language transfer are hardly available. * This work was done during an internship at NAVER AI Lab. Human 2. Sentence-Level Code-Switching 3. Monolingual Text 1. Token-Level Code-Switching 1. 자연어처리는 전산학과 인공지능의 세부 분야이다. 2. The goal of NLP is to enable computers to understand and respond to human language. 3. 자연어처리에서는 기계학습, 심층학습, 통계적 모델링 등 언어 이해를 위한 다양한 기법을 사용한다. 4. Over the years, NLP algorithms and language resources have advanced. 1. 자연어처리는 전산학과 인공지능의 세부 분야이다. 2. 자연어처리는 인간 언어를 이해하고 응답하는 것을 목표한다. 1. 자연어처리는 computer science와 artificial intelligence의 세부 분야이다. 2. The 목표 of NLP is to enable 컴퓨터 to 이해하고 and respond to 인간 언어. 3. NLP에서는 machine learning, 심층학습, statistical modeling 등 언어를 understand 위한 다양한 techniques을 사용한다.