WWW2026

Bridging Visual Dynamics and Narrative Reasoning: Multimodal Large Language Models for Short Drama Quality Assessment

Qingyang Liu, Jiangtong Li, Zelin Peng, Shaobo Wang, Zhaohe Liao, Shuochen Chang, Bingjie Gao, Haonan Zhao, Mu Liu, Jidong Jiang, Li Niu

摘要

Visual Question-Answering (VQA) is a complex multimodal task that requires integrating visual recognition and natural language understanding to answer questions about images. While significant progress has been made in English, resources and models for non-English languages, such as Italian, remain scarce. This paper addresses this gap by evaluating MiniCPM-V 2.6, a state-of-the-art multimodal Large Language Model, on GQA-it, the first large-scale Italian VQA dataset. The primary goal of this work is to investigate the performance of such models when applied off-the-shelf to this task and, if unsatisfactory, to explore how much they can improve with fine-tuning on Italian data. When applied off-the-shelf, MiniCPM-V 2.6 achieves an accuracy of 33.4%. However, after fine-tuning it on the GQA-it dataset, the performance improves significantly, reaching a state-of-the-art accuracy of 59.4%. These findings highlight the importance of language-specific adaptation in multilingual VQA tasks, especially for under-resourced languages like Italian. The trained model is released to the community on a dedicated Huggingface repository: https://huggingface.co/sag-uniroma2/MiniCPM-V-2_6-gqa-it-finetuned .