ACL2021

UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning

Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Trung Bui, Kyomin Jung

Abstract

Despite the success of various text generation metrics such as BERTScore, it is still difficult to evaluate the image captions without enough reference captions due to the diversity of the descriptions. In this paper, we introduce a new metric UMIC, an Unreferenced Metric for Image Captioning which does not require reference captions to evaluate image captions. Based on Vision-and-Language BERT, we train UMIC to discriminate negative captions via contrastive learning. Also, we observe critical problems of the previous benchmark dataset (i.e., human annotations) on image captioning metric, and introduce a new collection of human annotations on the generated captions. We validate UMIC on four datasets, including our new dataset, and show that UMIC has a higher correlation than all previous metrics that require multiple references. We release the benchmark dataset and pre-trained models to compute the UMIC 1 . 1 https://github.com/hwanheelee1993/UMIC Ref 1: A dog standing in the snow with a stick in its mouth. Ref 2: A little dog holding sticks in its mouth. Candidate: A dog standing on the snow with a dog CIDEr with Ref 1: 3.166 CIDEr with Ref 2: 0.281 Human Judgments : 1.875 out of 5 References -two giraffe standing next to each other in a field. -two giraffes are climbing a hill with mountains in the background. Candidate -three giraffes standing in a field of grass BLEU1: 0.324 ROUGE-L: 0.320 METEOR: 0.173 CIDER: 0.866 SPICE: 0.289 UMIC: 0.352 UMIC / 𝑪 : 0.770 Human: 0.200