EMNLP2024

Annotation alignment: Comparing LLM and human annotations of conversational safety

Rajiv Movva, Pang Wei Koh, Emma Pierson

7 citations

Abstract

Do LLMs align with human perceptions of safety? We study this question via annotation alignment, the extent to which LLMs and humans agree when annotating the safety of userchatbot conversations. We leverage the recent DICES dataset (Aroyo et al., 2023) , in which 350 conversations are each rated for safety by 112 annotators spanning 10 race-gender groups. GPT-4 achieves a Pearson correlation of r = 0.59 with the average annotator rating, higher than the median annotator's correlation with the average (r = 0.51). We show that larger datasets are needed to resolve whether LLMs exhibit disparities in how well they correlate with different demographic groups. Also, there is substantial idiosyncratic variation in correlation within groups, suggesting that race & gender do not fully capture differences in alignment. Finally, we find that GPT-4 cannot predict when one demographic group finds a conversation more unsafe than another. Here, we compare LLMs as annotators of chatbot safety to a diverse group of human annotators. We use the DICES dataset (Aroyo et al., 2023) , in which 350 user-chatbot conversations are each annotated for safety by 112 crowdworkers. Annotators span 10 race-gender groups, and these groups often disagree when rating safety (Figure 1 , left). We re-annotate each conversation using five leading LLMs and study three questions (Figure 1 , right): • RQ1: How well do LLM safety ratings align with average annotator ratings? • RQ2: Race & gender subgroups of annotators often rate safety differently. Do LLM ratings align more with some groups than others? • RQ3: Can GPT-4 predict when demographic groups disagree about safety?