ACL2024
Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks
Fakhraddin Alwajih, El Moatez Billah Nagoudi, Gagan Bhatia, Abdelrahman Mohamed, Muhammad Abdul-Mageed
被引用 5 次
摘要
Multimodal large language models (MLLMs) have proven effective in a wide range of tasks requiring complex reasoning and linguistic comprehension. However, due to a lack of high-quality multimodal resources in languages other than English, success of MLLMs remains relatively limited to English-based settings. This poses significant challenges in developing comparable models for other languages, including even those with large speaker populations such as Arabic. To alleviate this challenge, we introduce a comprehensive family of Arabic MLLMs, dubbed Peacock, with strong vision and language capabilities. Through comprehensive qualitative and quantitative analysis, we demonstrate the solid performance of our models on various visual reasoning tasks and further show their emerging dialectal potential. Additionally, we introduce Henna, a new benchmark specifically designed for assessing MLLMs on aspects related to Arabic culture, setting the first stone for culturally-aware Arabic MLLMs. For the first stage, we curate high-quality pretrain-054 ing data from publicly available English datasets. 055 We translate these datasets into Arabic and apply a 056 carefully designed pipeline to ensure the quality of 057 our training data. Similarly, we curate and translate 058 an instruction finetuning dataset which is essential 059 for achieving reasoning and conversational capabil-060 ities. 061 We showcase the performance of our models 062 across different tasks such as visual question an-063 swering (VQA) and visual reasoning. Our mod-064 els perform much better than a multilingual base-065 line mBlip (Geigle et al., 2023) on different tasks 066 1 and datasets, and we set the first comprehensive 067 Arabic vision-language benchmark to facilitate fu-068 ture work in this area. Finally, we demonstrate the 069 promising capabilities of our Peacock models in 070 interacting in dialectal Arabic by conducting a case 071 study on the Egyptian dialect. When finetuned on 072 a small set of Egyptian dialect data, our models 073 exhibit an interesting level of proficiency in their 074 dialectal responses when prompted in the same di-075 alect. We hope this acts as a spark for future works 076 in dialectal Arabic vision language models. 077 To summarize, our contributions in this paper 078 are as follows: (1) We introduce a suite of Arabic 079 MMLLs, dubbed Peacock, capable of instruction 080 following and visual reasoning, in addition to their 081 intriguing dialectal affinity. For developing Pea-082 cock, we use existing vision and language models. 083 We also offer a new language model, AraLLaMA, 084 based on LLaMA2-7B (Touvron et al., 2023). (2) 085 We introduce a diverse collection of Arabic trans-086 lated datasets carefully curated for the training and 087 evaluation of Arabic MLLMs. (3) We adapt the 088 popular LLaVA (Liu et al., 2023b) benchmark and 089 SEED-Bench (Li et al., 2023d) for Arabic MLLMs 090 evaluation. (4) We present Henna, a benchmark 091 designed to measure model capabilities in interpret-092 ing images related to Arabic culture. 093 The rest of this paper is organized as follows: 094 In Section 2, we provide an overview of related 095 work. Section 3 introduces Peacock, our family 096 of MLLMs. In Section 4, we describe our evalua-097 tion strategies and benchmarks. In Section 5, we 098 present our experiments, human evaluation, and a 099 comprehensive analysis of our models. We con-100 clude in Section 6. 101 2 Related Work 102 2.1 Multimodal Large Language Models 103 Progress in MLLMs is largely dependent on ad-104 vances in LLMs. Refer to Appendix A.2 for more 105 details on LLM-related works. The common trend 106 in recent MLLMs involves integrating an LLM as 107 their text decoder alongside a vision encoder for 108 image understanding. Several approaches were pro-109 posed for aligning the vision encoder with the text 110 decoder. Flamingo (Alayrac et al., 2022) and Ot-111 ter (Li et al., 2023c), for example, blend a vision 112 encoder with a resampler and a cross-gated at-113 tention layer, reducing the computational load in 114 vision-text cross-attention and enhancing instruc-115 tion optimization. While BLIP-2 (Li et al., 2023e) 167 VQA is by kamel et al. (2023) and explores closed-168 form VQA without attempting generative VQA. 169 We also know of no native Arabic datasets for ei-170 ther image captioning or VQA, with two excep-171 tions: AraCOCO (Mohamed et al., 2023) for im-172 age captioning, which is mainly used for evaluation, 173 and AVQA (kamel et al., 2023) for VQA, which 174 was automatically generated from MSCOCO for 175 Arabic VQA. In many works, translations of either 176