NeurIPS2022

TAP-Vid: A Benchmark for Tracking Any Point in a Video

Carl Doersch, Ankush Gupta, Larisa Markeeva, Adrià Recasens, Lucas Smaira, Yusuf Aytar, João Carreira, Andrew Zisserman, Yi Yang

被引用 281 次

DOI arXiv 出版方

摘要

Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation existed, until now. In this paper, we first formalize the problem, naming it tracking any point (TAP). We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. Central to the construction of our benchmark is a novel semi-automatic crowdsourced pipeline which uses optical flow estimates to compensate for easier, short-term motion like camera shake, allowing annotators to focus on harder sections of video. We validate our pipeline on synthetic data and propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data. Code and data are available at the following URL: https://github.com/deepmind/tapnet . TAP-Vid-DAVIS TAP-Vid-Kinetics TAP-Vid-RGB-Stacking TAP-Vid-Kubric Dataset Type #Videos (#Images) Duration/ Time-scale Longterm? Point Precise? Class Agnostic? Nonrigid? KITTI [16] Optical flow 156 400 frame pairs COCO-DensePose [19] Surface Points (50k) -COCO-WholeBody [32] Semantic Keypoints (200k) -DAVIS [53] Masks (multi-object) 150 25 fps @ 2-5s GOT-10k [24] BBs (single-object) 10k 10 fps @ 15s TAO [11] BBs (multi-object) 3k 1 fps @ 37s YouTube-BB [56] BBs (single-object) 240k 1 fps @ 20s PoseTrack [2] Semantic Keypoints 550 (37k) train: 30 fps @ 1s eval: 7 fps @ 3-5s 300VW [63] Facial Keypoints 300 30 fps @1-2 mins ScanNet [10] SfM 3D recons. 1500 ∼1 min MegaDepth [40] SfM 3D recons. 200 scenes (130k) -TAP-Vid-Kinetics Arbitrary points 1,189 25 fps @10s