EMNLP2025

Incorporating Diverse Perspectives in Cultural Alignment: Survey of Evaluation Benchmarks Through A Three-Dimensional Framework

Meng-Chen Wu, Si-Chi Chin, Tess Wood, Ayush Goyal, Narayanan Sadagopan

Abstract

Large Language Models (LLMs) increasingly serve diverse global audiences, making it critical for responsible AI deployment across cultures.While recent works have proposed various approaches to enhance cultural alignment in LLMs, a systematic analysis of their evaluation benchmarks remains needed.We propose a novel framework that conceptualizes alignment along three dimensions: Cultural Group (who to align with), Cultural Elements (what to align), and Awareness Scope (how to align: majority-focused vs. diversity-aware).Through this framework, we analyze 105 cultural alignment evaluation benchmarks, revealing significant imbalances: Region (37.9%) and Language (28.9%) dominate Cultural Group representation; Social and Political Relations (25.1%) and Speech and Language (20.9%) concentrate Cultural Elements coverage; and an overwhelming majority (97.1%) of datasets adopt majority-focused Awareness Scope approaches.In a case study examining AI safety evaluation across nine Asian countries (Section 5), we demonstrate how our framework reveals critical gaps between existing benchmarks and real-world cultural biases identified in the study, providing actionable guidance for developing more comprehensive evaluation resources tailored to specific deployment contexts. 1We extend the demographic proxies from Adilazuarda et al. ( 2024) to include appearance, sexual orientation, health, socioeconomic status, online communities, and time periods (e.g., dynasties) to better accommodate our dataset analysis.2 From Thompson et al. (2020), semantic domains include: food and drink, clothing and grooming, speech and language, modern world, physical world, the house, basic actions and technology, agriculture and vegetation, animals, social and political relations, emotions and values, kinship, and cognition, with names from Adilazuarda et al. (2024).We extend this list to include arts, mental perception, and current and historical events based on our dataset findings.3 These can be understood as prototypical representations of cultural knowledge (cf.Rosch and Mervis 1975; see also Shore 1996).We also recognize that some papers we classified as majority-focused do consider multiple annotator perspec-Hwaran Lee, Seokhee Hong, Joonsuk Park, Takyoung Kim, Gunhee Kim, and Jung-woo Ha. 2023.KoSBI: A dataset for mitigating social bias risks towards safer large language model applications.In