CVPR2025

CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

Xinhao Liu, Jintong Li, Yicheng Jiang, Niranjan Sujay, Zhicheng Yang, Juexiao Zhang, John Abanes, Jing Zhang, Chen Feng

Abstract

https://ai4ce.github.io/CityWalker/ Crossing CityWalker Turn Sign Obstacle Traffic Light Road blocked Dense Traffic Proximity Web-scale Videos (2000+ hours) Expert Data (6 hours) Figure 1. Embodied Urban Navigation. Navigating urban spaces is challenging for (especially off-street) mobile agents. The differently colored pins ( ) along the route highlight various critical scenarios unique to complex and dynamic urban landscapes. Thumbnails on the right with corresponding colored pins demonstrate the real-world observation of these challenging cases. Our CityWalker model is trained with over 2000 hours of city walking videos and fine-tuned with a small amount of expert data to address these challenges effectively.