EMNLP2025

KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts

Taebaek Hwang, Minseo Kim, Gisang Lee, Seonuk Kim, Hyunjun Eun

Abstract

Understanding and reasoning over text within visual contexts poses a significant challenge for Vision-Language Models (VLMs), given the complexity and diversity of real-world scenarios. To address this challenge, text-rich Visual Question Answering (VQA) datasets and benchmarks have emerged for high-resource languages like English. However, a critical gap persists for low-resource languages such as Korean, where the lack of comprehensive benchmarks hinders robust model evaluation and comparison. To bridge this gap, we introduce KRETA, a benchmark for Korean Reading and rEasoning in Text-rich VQA Attuned to diverse visual contexts. KRETA facilitates an in-depth evaluation of both visual text understanding and reasoning capabilities, while also supporting a multifaceted assessment across 15 domains and 26 image types. Additionally, we introduce a semiautomated VQA generation pipeline specifically optimized for text-rich settings, leveraging refined stepwise image decomposition and a rigorous seven-metric evaluation protocol to ensure data quality. While KRETA is tailored for Korean, we hope our adaptable and extensible pipeline will facilitate the development of similar benchmarks in other languages, thereby accelerating multilingual VLM research. The code and dataset for KRETA are available at https://github.com/tabtoyou/KRETA . advances in Vision-Language Models (VLMs) (Liu et al., 2023a; Wang et al., 2024; Zhang et al., 2024b) designed to handle these diverse visual contexts. Recently, the field has progressed beyond basic text recognition, with new benchmarks (Yue et al., 2024c; Hao et al., 2025) emphasizing higher-order reasoning over textual content within images. Addressing these challenges necessitates tightly integrated cross-modal understanding, leveraging domain knowledge and multi-step reasoning that cannot be achieved by treating visual and linguistic elements in isolation. However, low-resource languages including Korean lack benchmark suites even for basic text recognition, much less reasoning, impeding comprehensive evaluation and hindering model development across diverse domains (e.g., commerce, education) and image types (e.g., street signs, charts). Although recent multilingual VQA benchmarks (Tang et al., 2024b; Sun et al., 2024) have begun to address this disparity, they often struggle to provide sufficient coverage and depth for all languages. Existing Korean VQA datasets (Ju et al., 2024; Kim and Jung, 2025) often rely on translated English questions and non-Korean images, or are limited in scale (e.g., fewer than 650 samples). To fill the underexplored evaluation gap for Korean text-rich VQA, we propose KRETA, a benchmark for Korean Reading and rEasoning in Text-rich VQA Attuned to diverse visual contexts. Specifically, Figure 1 (a) shows how KRETA is built upon a wide range of real-world Korean imagery, which we systematically categorized into 15 domains by referring to the Korean Standard Industrial Classification (KSIC) (Statistics Korea, 2024) and 26 image types widely used in prior works (Yue et al., 2024a; Tang et al., 2024b). Furthermore, we carefully design a dual-level reasoning framework inspired by the concepts of System 1 and System