ACL2024
Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination
Nakyeong Yang, Taegwan Kang, Stanley Jungkyu Choi, Honglak Lee, Kyomin Jung
Abstract
Instruction-following language models often show undesirable biases. These undesirable biases are accelerated in the real-world usage of language models, where a wide range of instructions is used through zero-shot example prompting. To solve this problem, we first define bias neuron, which significantly affects biased outputs, and prove its existence empirically. Furthermore, we propose a novel and practical bias mitigation method, CRISPR, to eliminate bias neurons of language models in instruction-following settings. CRISPR automatically determines biased outputs and categorizes neurons that affect the biased outputs as bias neurons using an attribution, an explainability method. Experimental results demonstrate the effectiveness of our method in mitigating biases under zero-shot instructionfollowing settings without losing the model's task performance and existing knowledge. The experimental results reveal the generalizability of our method as it shows robustness under various instructions and datasets. Surprisingly, our method can mitigate the bias in language models by eliminating only a few neurons (e.g., three neurons). guage models typically arise from the relationship 042 between labels (e.g., "poor people") and tokens 043 (e.g., "drugs") within data instances (Zhao et al., 044 2021; Fei et al., 2023). 045 However, the association between labels and in-046 structions also causes a critical bias since various 047 instructions affect language models to behave in-048 consistently. Figure 2 shows the inconsistent be-049 havior of the Flan-T5-base in various synonymous 050 instructions on four datasets (Wang et al., 2018; 051 Parrish et al., 2021). These results indicate that 052 a language model is easily distracted by varying 053 instructions despite given semantically the same 054 meaning. These phenomena suggest that language 055 models exhibit significant cognitive biases in in-056 struction settings, and these are some of the most 057