NDSS2026

There is No War in Ba Sing Se: A Global Analysis of Content Moderation in Large Language Models

Friedemann Lipphardt, Moonis Ali, Martin Banzer, Anja Feldmann, Devashish Gosain

被引用 1 次

摘要

Large language models (LLMs) are widely used for information access, yet their content moderation behavior varies sharply across geographic and linguistic contexts. This paper presents a first comprehensive analysis of content moderation patterns detected in over 700,000 replies from 15 leading LLMs evaluated from 12 locations using 1,118 sensitive queries spanning five categories in 13 languages. We find substantial geographic variation, with moderation rates showing relative differences up to 60% across locations-for instance, soft moderation (e.g., evasive replies) appears in 14.3% of German contexts versus 24.9% in Zulu contexts. Categorywise, misc. (generally unsafe), hate speech, and sexual content are more heavily moderated than political or religious content, with political content showing the most geographic variability. We also observe discrepancies between online and offline model versions, such as DeepSeek exhibiting 15.2% higher relative soft moderation rates when deployed locally than via API. The response length (and time) analysis reveals that moderated responses are, on average, about 50% shorter than the unmoderated ones. These findings have important implications for AI fairness and digital equity, as users in different locations receive inconsistent access to information. We provide the first systematic evidence of geographic cross-language bias in LLM content moderation and showcase how model selection vastly impacts user experience. Content Warning This paper contains examples or references to potentially distressing content. Reader discretion is advised. I. INTRODUCTION In recent times, AI chatbots powered by Large Language Models (LLMs) have disrupted traditional ways of seeking information online. As widely used tools like ChatGPT [1], Claude [2], and Gemini [3] shape how billions access information, understanding their content moderation practices is crucial. Yet, their behavior across regions and languages remains poorly understood, making it vital to examine these variations to ensure AI fairness and digital equity. Content moderation in LLMs, filters or rejects queries deemed inappropriate, such as hate speech or sexual content. However, definitions of "inappropriate" vary across cultures and legal systems, posing challenges for globally deployed AI. While prior work has explored model bias [4] and safety [5], no study has systematically examined moderation consistency across regions, content categories, and languages. Overview: This paper presents the first comprehensive global analysis of content moderation in commercial LLMs. As shown in Figure 1 , we evaluate 15 leading models across 12 regions using 1,118 sensitive queries translated into 13 languages and spanning five categories: hate speech, politics, religion, sexuality, and miscellaneous topics. By issuing these queries through VPN vantage points (VPs), we collect over 700,000 responses. Overall, we answer the following broad research questions: