ACL2025
NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts
Muhammad Farid Adilazuarda, Musa Izzanardi Wijanarko, Lucky Susanto, Khumaisa Nur'aini, Derry Tanti Wijaya, Alham Fikri Aji
Abstract
Indonesia boasts over 700 languages, with a rich diversity of writing systems. However, most NLP development has been based on romanized text, with limited support for native writing systems. We present NUSAAK-SARA, a novel public benchmark for Indonesian languages that includes their original scripts. Our benchmark covers both text and image modalities and encompasses diverse tasks such as image segmentation, OCR, transliteration, translation, and language identification. Our data is constructed by human experts through rigorous steps. NUSAAKSARAcovers 8 scripts across 7 languages, including lowresource languages not commonly seen in NLP benchmarks.Among the scripts covered in this dataset, the Lampung script is included despite being unsupported by Unicode. We benchmark our data across several models, from LLMs and VLMs such as GPT-4o, Llama 3.2, and Aya 23 to task-specific systems such as PP-OCR and LangID. Our results reveal that most NLP technologies struggle with Indonesias local scripts, with many achieving near-zero performance. 1