NeurIPS2022
Contact-aware Human Motion Forecasting
Wei Mao, Miaomiao Liu, Richard I. Hartley, Mathieu Salzmann
40 citations
Abstract
In this paper, we tackle the task of scene-aware 3D human motion forecasting, which consists of predicting future human poses given a 3D scene and a past human motion. A key challenge of this task is to ensure consistency between the human and the scene, accounting for human-scene interactions. Previous attempts to do so model such interactions only implicitly, and thus tend to produce artifacts such as "ghost motion" because of the lack of explicit constraints between the local poses and the global motion. Here, by contrast, we propose to explicitly model the human-scene contacts. To this end, we introduce distance-based contact maps that capture the contact relationships between every joint and every 3D scene point at each time instant. We then develop a two-stage pipeline that first predicts the future contact maps from the past ones and the scene point cloud, and then forecasts the future human poses by conditioning them on the predicted contact maps. During training, we explicitly encourage consistency between the global motion and the local poses via a prior defined using the contact maps and future poses. Our approach outperforms the state-of-the-art human motion forecasting and human synthesis methods on both synthetic and real datasets. Our code is available at https://github.com/wei-mao-2019/ContAwareMotionPred . Recently, a few works [8, 6] have started to incorporate scene context in motion forecasting. In particular, Corona et al. [8] introduced a semantic-graph model that extracts a joint embedding of the human pose and an object of interest, such as a cup. This method, however, is ill-suited to model interactions with the whole scene itself, for example the floor or stairs that the person touches while walking. In [6], Cao et al. proposed a multi-stage pipeline that breaks down the motion forecasting into three sub-tasks: predicting a 2D goal, planning a 2D and 3D path, forecasting the 3D poses