EMNLP2021

Iconary: A Pictionary-Based Game for Testing Multimodal Communication with Drawings and Text

Christopher Clark, Jordi Salvador, Dustin Schwenk, Derrick Bonafilia, Mark Yatskar, Eric Kolve, Alvaro Herrasti, Jonghyun Choi, Sachin Mehta, Sam Skjonsberg, Carissa Schoenick, Aaron Sarnat, Hannaneh Hajishirzi, Aniruddha Kembhavi, Oren Etzioni, Ali Farhadi

6 citations

DOI Publisher

Abstract

Communicating with humans is challenging for AIs because it requires a shared understanding of the world, complex semantics (e.g., metaphors or analogies), and at times multimodal gestures (e.g., pointing with a finger, or an arrow in a diagram). We investigate these challenges in the context of Iconary, a collaborative game of drawing and guessing based on Pictionary, that poses a novel challenge for the research community. In Iconary, a Guesser tries to identify a phrase that a Drawer is drawing by composing icons, and the Drawer iteratively revises the drawing to help the Guesser in response. This back-and-forth often uses canonical scenes, visual metaphor, or icon compositions to express challenging words, making it an ideal test for mixing language and visual/symbolic communication in AI. We propose models to play Iconary and train them on over 55,000 games between human players. Our models are skillful players and are able to employ world knowledge in language models to play with words unseen during training. Elite human players outperform our models, particularly at the drawing task, leaving an important gap for future research to address. We release our dataset, code, and evaluation setup as a challenge to the community at github.com/allenai/iconary. Annotations 18.5%, 22.0% A noun is drawn by composing multiple icons, such as drawing 'sprinkler' with a spray bottle, drop, and fountain icons. Composition 47.5%, 64.0% Arrows, crosses, checkmarks, or circles guide interpretation, such as an arrow to indicate a part or crosses to specify incorrect options. Repurposing 22.5%, 32.0% An icon is used to represent an object different than its intended meaning, such as a scarf for 'scabbard', or box + ring for 'boxing ring'. Verb Scene 25.0%, 44.5% Multiple icons are arranged in a scene to indicate the verb. Verb Icon 41.0%, 25.5% A single icon is used to indicate the verb, such as hammer for 'building' or eyes for 'reading'. Verb Arrows 22.0%, 25.5% The verb is indicated by using arrows to show motion.