ACL2025

Model-Dependent Moderation: Inconsistencies in Hate Speech Detection Across LLM-based Systems

Neil Fasching, Yphtach Lelkes

摘要

Content moderation systems powered by large language models (LLMs) are increasingly deployed to detect hate speech; however, no systematic comparison exists between different systems. If different systems produce different outcomes for the same content, it undermines consistency and predictability, leading to moderation decisions that appear arbitrary or unfair. Analyzing seven leading models-dedicated Moderation Endpoints (OpenAI, Mistral), frontier LLMs (Claude 3.5 Sonnet, GPT-4o, Mistral Large, DeepSeek V3), and specialized content moderation APIs (Google Perspective API)-we demonstrate that moderation system choice fundamentally determines hate speech classification outcomes. Using a novel synthetic dataset of 1.3+ million sentences from a factorial design, we find identical content receives markedly different classification values across systems, with variations especially pronounced for specific demographic groups. Analysis across 125 distinct groups reveals these divergences reflect systematic differences in how models establish decision boundaries around harmful content, highlighting significant implications for automated content moderation. Research has shown that online hate speech 1 is on the rise, polarizes public opinion, hurts political discourse, and may even have offline impacts on mental and physical health (Hangartner et al., 2021; Müller and Schwarz, 2021) . Further, social media 1 Following previous research, we define hate speech as communication that disparages a person or group based on their perceived protected characteristics such as race, ethnicity, gender, and sexual orientation (Tonneau et al., 2024; Schmidt and Wiegand, 2017