KDD2026

IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation

Khanh-Tung Tran, Duc-Hai Nguyen, Barry O'Sullivan, Hoang D. Nguyen

4 citations

Abstract

Recent advances in Large Language Models (LLMs) have demonstrated promising capabilities, yet their performance in multilingual and low-resource settings remains modest. Existing benchmarks often exhibit cultural bias, restrict evaluation to text-only, rely on multiple-choice formats, and, more importantly, are ineffectual for extremely low-resource languages. To address these gaps, we introduce IRLBench, presented in parallel English and Irish, which is considered definitely endangered by UNESCO. Our benchmark consists of 12 representative subjects developed from the 2024 Irish Leaving Certificate exam, enabling fine-grained analysis of model capabilities across domains. By framing the task as long-form generation and leveraging the official marking scheme, it supports not only a comprehensive evaluation of correctness but also language fidelity. Our extensive experiments of leading closed-source and open-source LLMs reveal a persistent performance gap between English and Irish, in which models produce valid Irish responses less than 80% of the time, and answer correctly 55.8% of the time compared to 76.2% in English for the best-performing model. With Irish as the case study, our work exposes systemic weaknesses in today's multilingual LLMs and provides a rigorous benchmark for evaluating true multilingual capabilities. We release IRLBench and an accompanying evaluation codebase to enable future research on robust, culturally aware multilingual AI development.