EMNLP2021

Region under Discussion for visual dialog

Mauricio Mazuecos, Franco M. Luque, Jorge Sánchez, Hernán Maina, Thomas Vadora, Luciana Benotti

Abstract

Visual Dialog is assumed to require the dialog history to generate correct responses during a dialog. However, it is not clear from previous work how dialog history is needed for visual dialog. In this paper we define what it means for visual questions to require dialog history and we propose a methodology for identifying them. We release a subset of the Guesswhat?! questions for which their dialog history completely changes their responses. We propose a novel interpretable representation that visually grounds dialog history: the Region under Discussion. It constrains the image's spatial features according to a semantic representation of the history inspired by the information structure notion of Question under Discussion. We evaluate the architecture on task-specific multimodal models and the visual transformer model LXMERT and show that there is still room for improvement. Question HR CMO +RuD 1. is it human? no no no 2. is it food? no no no 3. is it on the gas stove? no no no 4. is it on the nearby counter top? yes yes yes 5. is it red? no no no 6. is the yellow spoon in the plate? no no no 7. is a bottle? yes no no 8. the big one near the white plate? yes no yes 1. it is a sign? no no no 2. it is a car? yes yes yes 3. it is grey? no no no 4. it is brown? yes no yes 5. it is front the other car? yes no no 1. is it a vehicle? no no no 2. is it a person? no no no 3. is it a building? no no no 4. is the color red? no no no 5. is it the sign board? no no no 6. is it a traffic light? yes yes yes 7. is it in middle? no no no 8. is it the first one? yes no yes