ACL2024

Can Large Multimodal Models Uncover Deep Semantics Behind Images?

Yixin Yang, Zheng Li, Qingxiu Dong, Heming Xia, Zhifang Sui

Abstract

Understanding the deep semantics of images is essential in the era dominated by social media. However, current research works primarily on the superficial description of images, revealing a notable deficiency in the systematic investigation of the inherent deep semantics. In this work, we introduce DEEPEVAL, a comprehensive benchmark to assess Large Multimodal Models' (LMMs) capacities of visual deep semantics. DEEPEVAL includes humanannotated dataset and three progressive subtasks: fine-grained description selection, indepth title matching, and deep semantics understanding. Utilizing DEEPEVAL, we evaluate 9 open-source LMMs and GPT-4V(ision). Our evaluation demonstrates a substantial gap between the deep semantic comprehension capabilities of existing LMMs and humans. For example, GPT-4V is 30% behind humans in understanding deep semantics, even though it achieves human-comparable performance in image description. Further analysis reveals that LMM performance on DEEPEVAL varies according to the specific facets of deep semantics explored, indicating the fundamental challenges remaining in developing LMMs. 1 Please write the image description. Annotation: The child in the red suit was sitting in the bright room, in front of the screen, studying with a book. He says, "Here." Outside the window, a child in tattered clothes was also studying,he also says "Here". Please draft the image title. Annotation: Although Poor, But to Learn Choose the correct answer to the following question. Which following text is the deep semantics of the image? A. This picture shows that with the development of technology, ... B. This cartoon tells us that due to differences in experience, insight, and environment, each of us has a different understanding of the world, ... C. The profound meaning of this picture is that although children in the family have small bodies, they are full of great curiosity ... D. Rich or poor, every child has the right to learn. Keep on learning even if you are poor.