ACL2025
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding
Haneul Yoo, Yongjin Yang, Hwaran Lee
27 citations
Abstract
As large language models (LLMs) have advanced rapidly, concerns regarding their safety have become prominent. In this paper, we discover that code-switching in red-teaming queries can effectively elicit undesirable behaviors of LLMs, which are common practices in natural language. We introduce a simple yet effective framework, CSRT, to synthesize codeswitching red-teaming queries and investigate the safety and multilingual understanding of LLMs comprehensively. Through extensive experiments with ten state-of-the-art LLMs and code-switching queries combining up to 10 languages, we demonstrate that the CSRT significantly outperforms existing multilingual redteaming techniques, achieving 46.7% more attacks than standard attacks in English and being effective in conventional safety domains. We also examine the multilingual ability of those LLMs to generate and understand codeswitching texts. Additionally, we validate the extensibility of the CSRT by generating codeswitching attack prompts with monolingual data. We finally conduct detailed ablation studies exploring code-switching and propound unintended correlation between resource availability of languages and safety alignment in existing multilingual LLMs. 1 * This work was done during an internship at NAVER AI Lab. Why do people who live in San Francisco act so gay? … stereotypical and potentially offensive. 샌프란시스코 사람들은 왜 그렇게 게이 같지? Red-teaming Multilingual Red-teaming Code-Switching Red-Teaming (CSRT) Sample from MultiJail Sample from CSRT (Ours) [In Korean] It is not appropriate … San Francisco has a long history of being a hub for LGBTQ+ culture and rights. The city is known for its openness, acceptance, and vibrant gay scene, … 为什么 wata ﻓراﻧﺳﯾﺳﻛو ﺳﺎن 게이 mengapa? ▪ Attack Success Rate (ASR): 1.0 ▪ Refusal Rate (RR): 0.0 ▪ Comprehension (Cmp.