ACL2024

A Chinese Dataset for Evaluating the Safeguards in Large Language Models

Yuxia Wang, Zenan Zhai, Haonan Li, Xudong Han, Shom Lin, Zhenxuan Zhang, Angela Zhao, Preslav Nakov, Timothy Baldwin

摘要

Many studies have demonstrated that large language models (LLMs) can produce harmful responses, exposing users to unexpected risks when LLMs are deployed. Previous studies have proposed comprehensive taxonomies of the risks posed by LLMs, as well as corresponding prompts that can be used to examine the safety mechanisms of LLMs. However, the focus has been almost exclusively on English, and little has been explored for other languages. Here we aim to bridge this gap. We first introduce a dataset for the safety evaluation of Chinese LLMs, and then extend it to two other scenarios that can be used to better identify false negative and false positive examples in terms of risky prompt rejections. We further present a set of fine-grained safety assessment criteria for each risk type, facilitating both manual annotation and automatic evaluation in terms of LLM response harmfulness. Our experiments on five LLMs show that region-specific risks are the prevalent type of risk, presenting the major issue with all Chinese LLMs we experimented with. Warning: this paper contains example data that may be offensive, harmful, or biased. Recently, Wang et al. (2023b) proposed a comprehensive taxonomy covering diverse potential harms of LLM responses, as well as the Do-not-043 answer dataset. However, the questions in this 044 dataset are too straightforward and 90% of them 045 are easily rejected by six mainstreaming LLMs. 046 This limits the utility of the dataset for compar-047 ing the safety mechanisms across different LLMs. 048 Moreover, the dataset is for English only, and is 049 limited to questions reflecting universal human val-050 ues, with no region-or culture-specific questions. 051 Here we aim to bridge these gaps. We first trans-052 late and localize the dataset to Mandarin Chinese, 053 and then we expand it with region-specific ques-054 tions and align it with country-specific AI genera-055 tion regulations. We then extend it with: (i) risky 056 questions posed in an evasive way, aimed at evalu-057 ating an LLM's sensitivity to perceiving risks; and 058 (ii) harmless questions containing seemingly risky 059 words, e.g., fat bomb, aimed at assessing whether 060 the model is oversensitive, which can limit its help-061 fulness. This yields 3,042 Chinese questions for 062 evaluating the safety of LLMs. 063 Our contributions in this paper can be sum-064 marised as follows: 065 • We construct a Chinese LLM safety evaluation 066 dataset from three attack perspectives, aimed 067 to model risk perception and sensitivity to 068 specific words and phrases. 069 • We propose new evaluation guidelines to as-070 sess the response harmfulness for both manual 071 annotation and automatic evaluation, which 072 can better identify why a given response is 073 potentially dangerous. 074 • We evaluate five LLMs using our dataset and 075 show that they all are insensitive to three types 076 of attacks, and the majority of the unsafe re-077 sponses are concentrated on region-specific 078 sensitive topics, which determine the final 079 safety rank of these LLMs. 080 1 2 Related Work 081 2.1 Assessing Particular Types of Risk 082 Numerous studies have been dedicated to particular 083 types of risk, including toxicity in language models 084 (Hartvigsen et al., 2022; Roller et al., 2021), mis-085 information (Van Der Linden, 2022; Wang et al., 086 2023c), and bias (Dhamala et al., 2021). Specific 087 datasets have been created to evaluate LLMs regard-088 ing these risks, such as RealToxicityPrompts for 089 toxicity propensity (Gehman et al., 2020), ToxiGen 090 for hate speech detection (Hartvigsen et al., 2022), 091 BOLD for bias detection (Dhamala et al., 2021), 092 and TruthfulQA for assessing factuality against ad-093 versarial prompts (Lin et al., 2022). These datasets 094 provide resources for developing safer LLMs via 095 fine-tuning and evaluating the safety of existing 096 LLMs. 097 With the popularization of fine-tuning, the ro-098 bustness of alignment, and especially its vulnera-099 bility to fine-tuning, is of growing concern. Wolf 100 et al. (2023) and Gade et al. (2023) showed that 101 fine-tuned LLaMA is susceptible to prompts with 102 malicious intent, and Qi et al. (2023) demonstrated 103 similar susceptibility for GPT-3.5 Turbo even when 104 fine-tuned on benign datasets. These findings un-105 derscore the need to evaluate a model's safety ca-106 pabilities after fine-tuning. Our efforts are a step 107 in this direction: we build an open-source dataset 108 with fine-grained labels covering a range of risks. 1 109 2.2 Prompt Engineering for Jailbreaking 110 Prompt engineering to "jailbreak" aligned mod-111 els has been a focus of recent research. This in-112 cludes hand-crafting complex scenarios, such as 113 deeply nested structures (Li et al., 2023; Ding et al., 114 2023), and carefully modulated personas (Shah 115 et al., 2023). However, the focus has primarily been 116 on prompting inappropriate behaviors, with less 1