ICLR2026

Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation

Israfel Salazar, Manuel Fernández Burda, Shayekh Bin Islam, Arshia Soltani Moakhar, Shivalika Singh, Fabian Farestam, Angelika Romanou, Danylo Boiko, Dipika Khullar, Mike Zhang, Dominik Krzemiński, Jekaterina Novikova, Luísa Shimabucoro, Joseph Marvin Imperial, Rishabh Maheshwary, Sharad Duwal, Alfonso Amayuelas, Swati Rajwal, Jebish Purbey, Ahmed Ruby, Nicholas Popovič, Marek Suppa, Azmine Toushik Wasi, Ram Mohan Rao Kadiyala, Olga Tsymboi, Maksim Kostritsya, Bardia soltani moakhar, Gabriel da Costa Merlin, Otávio Ferracioli Coletti, Maral Jabbarishiviari, MOHAMMADAMIN FARAHANIFARD, Silvia Andrea Fernandez, María Grandury, Dmitry Abulkhanov, Drishti Sharma, Andre Guarnier De Mitri, Leticia Bossatto Marchezi, Setayesh Heydari, Johan Obando-Ceron, Nazar Kohut, Beyza Ermis, Desmond Elliott, Enzo Ferrante, Sara Hooker, Marzieh Fadaee

被引用 6 次

DOI arXiv 出版方

摘要

blend image and text modalities. Our dataset pushes beyond simple captioning tasks, challenging models to reason about visual content in various topics, the way humans are evaluated in exams worldwide. Through a large-scale open science effort across 18 languages, we construct Kaleidoscope (see Figure 1 ), featuring a diverse selection of knowledge domains across 14 subjects. With 55% of the total 20,911 questions requiring image understanding for accurate resolution, our work aims to establish a comprehensive, and inclusive evaluation framework for multimodal language models. We evaluate a wide range of state-of-the-art models on Kaleidoscope, including Claude 3.5 Sonnet (Anthropic, 2024), GPT-4o (OpenAI et al., 2024), and Gemini-V (Google et al., 2024), as well as smaller open-weight VLMs, such as Aya-Vision model family (Cohere-For-AI-Team, 2025), Molmo (Deitke et al., 2024) Pangea (Yue et al., 2025), and Qwen2.5-VL model family (Qwen-Team, 2025). Our key contributions and findings are highlighted here: et al., 2024), closely mimicking conventional human testing methodologies. Our work is built around three core design principles that guide the selection, curation, processing, and addition of exams: Multimodality: Images are central to Kaleidoscope, as we aim to evaluate how VLMs integrate and reason about visual information to answer questions. We prioritize multimodal questions with diverse image types, complemented by a similar proportion of text-only questions for a complete assessment and comparison. Multilinguality: The benchmark contains questions in 18 languages, with a focus on underrepresented mid-and low-resource languages (e.g., Nepali, Lithuanian) alongside high-resource languages (e.g., English, Spanish) for a thorough evaluation across a broad range of languages. Diversity: Our goal is to collect exams covering as wide a range of topics as possible ranging from Mathematics and Sociology, to Medicine and Driving Licenses, ensuring comprehensive evaluation across various domains. The final collection includes exams from 14 different domains, collected from 18 countries and with varying educational levels (from high school to professional exams), allowing detailed clustering and comprehensive evaluation. Global Collaboration Our work entailed an extensive, open science process to manually collect data by working directly with native speakers of different languages (