ACL2025

Robustness and Confounders in the Demographic Alignment of LLMs with Human Perceptions of Offensiveness

Shayan Alipour, Indira Sen, Mattia Samory, Tanushree Mitra

摘要

Despite a growing literature finding that large language models (LLMs) exhibit demographic biases, reports with whom they align best are hard to generalize or even contradictory. In this work, we examine the alignment of LLMs with human annotations in five offensive language datasets, comprising approximately 220K annotations. While demographic traits, particularly race, influence alignment, these effects vary across datasets and are often entangled with other factors. Confounders introduced in the annotation process-such as document difficulty, annotator sensitivity, and withingroup agreement-account for more variation in alignment patterns than demographic traits. Alignment increases with annotator sensitivity and group agreement, and decreases with document difficulty. Our results underscore the importance of multi-dataset analyses and confounder-aware methodologies in developing robust measures of demographic bias. Recent work has focused on using generative LLMs, such as Flan-T5 and GPT for data labeling for various social constructs like offensive language (Zampieri et al., 2023), stance detection (Aiyappa et al., 2024) , hate speech (Huang et al., 2023), and framing (Gilardi et al., 2023).