ACL2023

Causal-Debias: Unifying Debiasing in Pretrained Language Models and Fine-tuning via Causal Invariant Learning

Fan Zhou, Yuzhou Mao, Liu Yu, Yi Yang, Ting Zhong

被引用 21 次

摘要

Demographic biases and social stereotypes are common in pretrained language models (PLMs), and a burgeoning body of literature focuses on removing the unwanted stereotypical associations from PLMs. However, when fine-tuning these bias-mitigated PLMs in downstream natural language processing (NLP) applications, such as sentiment classification, the unwanted stereotypical associations resurface or even get amplified. Since pretrain&fine-tune is a major paradigm in NLP applications, separating the debiasing procedure of PLMs from fine-tuning would eventually harm the actual downstream utility. In this paper, we propose a unified debiasing framework Causal-Debias to remove unwanted stereotypical associations in PLMs during fine-tuning. Specifically, Causal-Debias mitigates bias from a causal invariant perspective by leveraging the specific downstream task to identify bias-relevant and label-relevant factors. We propose that bias-relevant factors are non-causal as they should have little impact on downstream tasks, while label-relevant factors are causal. We perform interventions on non-causal factors in different demographic groups and design an invariant risk minimization loss to mitigate bias while maintaining task performance. Experimental results on three downstream tasks show that our proposed method can remarkably reduce unwanted stereotypical associations after PLMs are fine-tuned, while simultaneously minimizing the impact on PLMs and downstream applications. © 2023 Association for Computational Linguistics.