CVPR2022

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Jun Chen, Han Guo, Kai Yi, Boyang Li, Mohamed Elhoseiny

169 citations

Abstract

The limited availability of annotated data often hinders real-world applications of machine learning. To efficiently learn from small quantities of multimodal data, we leverage the linguistic knowledge from a large pre-trained language model (PLM) and quickly adapt it to new domains of image captioning. To effectively utilize a pretrained model, it is critical to balance the visual input and prior linguistic knowledge from pretraining. We propose Visu-alGPT, which employs a novel self-resurrecting encoderdecoder attention mechanism to quickly adapt the PLM with a small amount of in-domain image-text data. The proposed self-resurrecting activation unit produces sparse activations that prevent accidental overwriting of linguistic knowledge. When trained on 0.1%, 0.5% and 1% of the respective training sets, VisualGPT surpasses the best baseline by up to 10.0% CIDEr on MS COCO [45] and 17.9% CIDEr on Conceptual Captions [69] . Furthermore, VisualGPT achieves the state-of-the-art result on IU X-ray [15], a medical report generation dataset. Our code is available at https://github.com/Vision-CAIR/ VisualGPT .