NeurIPS2023

Object Reprojection Error (ORE): Camera pose benchmarks from lightweight tracking annotations

Xingyu Chen, Weiyao Wang, Hao Tang, Matt Feiszli

Abstract

3D spatial understanding is highly valuable in the context of semantic modeling of environments, agents, and their relationships. Semantic modeling approaches employed on monocular video often ingest outputs from off-the-shelf SLAM/SfM pipelines, which are anecdotally observed to perform poorly or fail completely on some fraction of the videos of interest. These target videos may vary widely in complexity of scenes, activities, camera trajectory, etc. Unfortunately, such semantically-rich video data often comes with no ground-truth 3D information, and in practice it is prohibitively costly or impossible to obtain ground truth reconstructions or camera pose post-hoc. This paper proposes a novel evaluation protocol, Object Reprojection Error (ORE) to benchmark camera trajectories; ORE computes reprojection error for static objects within the video and requires only lightweight object tracklet annotations. These annotations are easy to gather on new or existing video, enabling ORE to be calculated on essentially arbitrary datasets. We show that ORE maintains high rank correlation with standard metrics based on ground truth. Leveraging ORE, we source videos and annotations from Ego4D-EgoTracks, resulting in EgoStatic, a large-scale diverse dataset for evaluating camera trajectories in-the-wild. 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks. Contributions. This work proposes a novel, object-centric metric for camera trajectory quality on essentially arbitrary video (e.g. videos in Fig. 1 ). As a measure of environmental understanding, the resulting benchmark uses a type of reprojection error which is a generalization of geometric reprojection error; in this way it is similar to traditional SfM / SLAM benchmarks. However, using a sparse set of semantic landmarks, specifically static objects whose identities are unknown to the method under evaluation, keeps the benchmark focused on high-level percepts in the vicinity of the camera, vs. global maps and reconstruction accuracy. Object Reprojection Error (ORE) relaxes the need for accurate groundtruth camera trajectories. Given an arbitrary video, one identifies a few "suitable" object candidates and annotates 2D bounding boxes across the frames where the object remains unmoved (Fig. 1 ). Despite no groundtruth camera trajectory, ORE's rank statistics agree well with standard GT-based metrics, such as Absolute Trajectory Error (ATE) and Relative Pose Error (RPE), when properly-selected tracklets are used. Equipped with this schema, we source a wide variety of ego-centric videos from Ego4D, a largescale egocentric video collected in-the-wild. We leveraged the long-term tracking annotations from EgoTracks [86] on Ego4D. Our contributions include: 1. We carefully design a new evaluation protocol for camera trajectory estimation which only requires static object tracklet annotation and no camera trajectory groundtruth: ORE. 2. We benchmark 7 SLAM, Visual Odometry (VO) and Structure-from-Motion (SfM) methods on Scannet test set [21] to compare ORE with standard metrics. Rank correlation shows high agreement between ORE and standard metrics. 3. We extend and adapt Ego4D-EgoTracks [86] benchmark into a first large-scale egocentric camera trajectory benchmark. The resulting benchmark is shown to be quite challenging. 4. Finally, ORE is a useful tool: it is sensitive enough to inform hyperparameter selection and method design. The experiments reveal potential directions for future work.