NeurIPS2023

A Multi-modal Global Instance Tracking Benchmark (MGIT): Better Locating Target in Complex Spatio-temporal and Causal Relationship

Shiyu Hu, Dailing Zhang, Meiqi Wu, Xiaokun Feng, Xuchen Li, Xin Zhao, Kaiqi Huang

被引用 26 次

摘要

Tracking an arbitrary moving target in a video sequence is the foundation for high-level tasks like video understanding. Although existing visual-based trackers have demonstrated good tracking capabilities in short video sequences, they always perform poorly in complex environments, as represented by the recently proposed global instance tracking task, which consists of longer videos with more complicated narrative content. Recently, several works have introduced natural language into object tracking, desiring to address the limitations of relying only on a single visual modality. However, these selected videos are still short sequences with uncomplicated spatio-temporal and causal relationships, and the provided semantic descriptions are too simple to characterize video content. To address these issues, we (1) first propose a new multi-modal global instance tracking benchmark named MGIT. It consists of 150 long video sequences with a total of 2.03 million frames, aiming to fully represent the complex spatio-temporal and causal relationships coupled in longer narrative content. (2) Each video sequence is annotated with three semantic grains (i.e., action, activity, and story) to model the progressive process of human cognition. We expect this multi-granular annotation strategy can provide a favorable environment for multi-modal object tracking research and long video understanding. (3) Besides, we execute comparative experiments on existing multi-modal object tracking benchmarks, which not only explore the impact of different annotation methods, but also validate that our annotation method is a feasible solution for coupling human understanding into semantic labels. (4) Additionally, we conduct detailed experimental analyses on MGIT, and hope the explored performance bottlenecks of existing algorithms can support further research in multi-modal object tracking. The proposed benchmark, experimental results, and toolkit will be released gradually on http://videocube.aitestunion.com/ . Motion (What) Third-party (Optional) Location (Where) Time Interval (When) Target (Who) D1 Story: A male secret agent wearing a black suit walks in the washroom, and stands near a man wearing a light grey suit. They fight, and the male secret agent wins. He then lifts the insensible grey-suit man to the washroom cubicle. The male secret agent crouches in the washroom cubicle and checks the insensible grey-suit man. Suddenly, the grey-suit man wakes up, and they fight together again in the washroom. Eventually, the male secret agent wins the fight. After the male secret agent talks with a woman wearing a brown suit, he again lifts the insensible grey-suit man to the washroom cubicle. Finally, the male secret agent lefts the washroom after talking with the brown-suit woman. #000001 #003580 #005100 #007120 #009030 Activity 1: A male secret agent wearing a black suit walks in the washroom, and stands near a man wearing a light grey suit. They fight, then the male secret agent wins, and lifts the insensible grey-suit man to the washroom cubicle.