ICLR2025
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Tianshuo Yang, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao
Abstract
Visual Question-Answering (VQA) is a complex multimodal task that requires integrating visual recognition and natural language understanding to answer questions about images. While significant progress has been made in English, resources and models for non-English languages, such as Italian, remain scarce. This paper addresses this gap by evaluating MiniCPM-V 2.6, a state-of-the-art multimodal Large Language Model, on GQA-it, the first large-scale Italian VQA dataset. The primary goal of this work is to investigate the performance of such models when applied off-the-shelf to this task and, if unsatisfactory, to explore how much they can improve with fine-tuning on Italian data. When applied off-the-shelf, MiniCPM-V 2.6 achieves an accuracy of 33.4%. However, after fine-tuning it on the GQA-it dataset, the performance improves significantly, reaching a state-of-the-art accuracy of 59.4%. These findings highlight the importance of language-specific adaptation in multilingual VQA tasks, especially for under-resourced languages like Italian. The trained model is released to the community on a dedicated Huggingface repository: https://huggingface.co/sag-uniroma2/MiniCPM-V-2_6-gqa-it-finetuned .