ACL2022

Voxel-informed Language Grounding

Rodolfo Corona, Shizhan Zhu, Dan Klein, Trevor Darrell

Abstract

Natural language applied to natural 2D images describes a fundamentally 3D world. We present the Voxel-informed Language Grounder (VLG), a language grounding model that leverages 3D geometric information in the form of voxel maps derived from the visual input using a volumetric reconstruction model. We show that VLG significantly improves grounding accuracy on SNARE (Thomason et al., 2021) , an object reference game task. At the time of writing, VLG holds the top place on the SNARE leaderboard, 1 achieving SOTA results with a 2.0% absolute improvement.