ACL2025
CaLMQA: Exploring culturally specific long-form question answering across 23 languages
Shane Arora, Marzena Karpinska, Hung-Ting Chen, Ipsita Bhattacharjee, Mohit Iyyer, Eunsol Choi
摘要
Despite rising global usage of large language models (LLMs), their ability to generate longform answers to culturally specific questions remains unexplored in many languages. To fill this gap, we perform the first study of textual multilingual long-form QA by creating CALMQA, a dataset of 51.7K culturally specific questions across 23 different languages. We define culturally specific questions as those that refer to concepts unique to one or a few cultures, or have different answers depending on the cultural or regional context. We obtain these questions by crawling naturallyoccurring questions from community web forums in high-resource languages, and by hiring native speakers to write questions in underresourced, rarely-studied languages such as Fijian and Kirundi. Our data collection methodologies are translation-free, enabling the collection of culturally unique questions like 'Kuber iki umwami wa mbere w'uburundi yitwa Ntare?" (Kirundi; English translation: "Why was the first king of Burundi called Ntare (Lion)?"). We evaluate factuality, relevance and surface-level quality of LLM-generated long-form answers, finding that (1) for many languages, even the best models make critical surface-level errors (e.g., answering in the wrong language, repetition), especially for lowresource languages; and (2) answers to culturally specific questions contain more factual errors than answers to culturally agnostic questions -questions that have consistent meaning and answer across many cultures. We release CALMQA to facilitate future research in cultural and multilingual long-form QA. github.com/2015aroras/CaLMQA hf.co/datasets/shanearora/CaLMQA © CC BY 4.0