CVPR2022
Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation
Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev
150 citations
Abstract
To p o l o g i c a l Mapping Global Action Planning Instruction: "go into the living room and water the plant on the table." h Shortest Route Planning Next Location Local Actions Panorama Encoding Graph Update Instruction Dynamic Fusion step 𝑡+1: panorama + GPS location Coarse-scale Encoding Fine-scale Encoding a b c d e f h i g e h step 𝑡: panorama + GPS location map 𝑡-1 map 𝑡 g h j g g Figure 1 . An agent is required to navigate in unseen environments to reach target locations according to language instructions. It only obtains local observations of the environment and is allowed to make local actions, i.e., moving to neighboring locations. In this work, we propose to build topological maps on-the-fly to enable long-term action planning. The map contains visited nodes and navigable nodes that can be reached from the previously visited nodes. Our method predicts global actions, i.e., all navigable nodes in the map, and trades off complexity by combining a coarse-scale graph encoding with a fine-scale encoding of observations at the current node .