CVPR2023

Streaming Video Model

Yucheng Zhao, Chong Luo, Chuanxin Tang, Dongdong Chen, Noel Codella, Zheng-Jun Zha

Abstract

Figure 1. Illustration of the proposed streaming video model with a comparison to conventional frame-based architecture and clip-based architecture. (a) The two-stage streaming video model gracefully serves different types of video tasks through a unified architecture. The output of the temporal-aware (T-aware) spatial encoder serves the frame-based tasks, such as MOT, while the output of the temporal decoder serves the sequence-based tasks, such as action recognition. (b) Frame-based architecture, which uses single image model to independently extract spatial features for each frame, is widely used in the frame-based video tasks. (c) Clip-based architecture, which uses video model to produce the spatiotemporal features for an entire clip, is widely used in the sequence-based video tasks.