ICCV2019

Entangled Transformer for Image Captioning

Guang Li, Linchao Zhu, Ping Liu, Yi Yang

346 citations

Abstract

In image captioning, the typical attention mechanisms are arduous to identify the equivalent visual signals especially when predicting highly abstract words. This phenomenon is known as the semantic gap between vision and language. This problem can be overcome by providing semantic attributes that are homologous to language. Thanks to the inherent recurrent nature and gated operating mechanism, Recurrent Neural Network (RNN) and its variants are the dominating architectures in image captioning. However, when designing elaborate attention mechanisms to integrate visual inputs and semantic attributes, RNN-like variants become unflexible due to their complexities. In this paper, we investigate a Transformer-based sequence modeling framework, built only with attention layers and feedforward layers. To bridge the semantic gap, we introduce EnTangled Attention (ETA) that enables the Transformer to exploit semantic and visual information simultaneously. Furthermore, Gated Bilateral Controller (GBC) is proposed to guide the interactions between the multimodal information. We name our model as ETA-Transformer. Remarkably, ETA-Transformer achieves state-of-the-art performance on the MSCOCO image captioning dataset. The ablation studies validate the improvements of our proposed modules. 𝑇 𝑣 : a bunch of fruit sitting in a sink. 𝑇 𝑠 : a table with a lot of food on it. 𝐸𝑇𝐴 : a bowl of fruits and vegetables on a stove. 𝑇 𝑣 : a baby girl laying on a bed holding a toy. 𝑇 𝑠 : a baby girl laying on a bed with a bed. 𝐸𝑇𝐴: a baby sitting on a bed with a bottle. 𝑇 𝑣 : a giraffe eating from a feeder in a zoo. 𝑇 𝑠 : a giraffe eating a tree with a tree in background. 𝐸𝑇𝐴: a giraffe eating hay out of a feeder. 𝑇 𝑣 : a clock hanging from a wall next to a window. 𝑇 𝑠 : a large clock sitting on top of a wall. 𝐸𝑇𝐴: a clock hanging on the side of a building.