CVPR2024

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, Sergey Tulyakov

DOI 出版方

摘要

HDVILA-100M "He thought he was gonna get shows terrible communication on the teams part." "We're gonna cook this all together stirring it constantly for just a minute until it smells nice and fragrant." "It is a close-up shot of a brown and white english bulldog with wrinkles on its face, sitting on a person's lap." "It is a red and purple betta fish swimming in a tank with gravel and plants." "A person is adding chicken broth to a pot of quinoa on a stove." ⇤ This work was done while interning at Snap. licly available HD-VILA-100M dataset. We then split them into semantically consistent video clips, and apply multiple cross-modality teacher models to obtain captions for each video. Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation. In this way, we get 70M videos paired with high-quality text captions. We dub the dataset as Panda-70M. We show the value of the proposed dataset on three downstream tasks: video captioning, video and text retrieval, and text-driven video generation. The models trained on the proposed data score substantially better on the majority of metrics across all the tasks. This CVPR paper is the Open Access version, provided by the Computer Vision Foundation. Except for this watermark, it is identical to the accepted version; the final published version of the proceedings is available on IEEE Xplore. HowTo100M [52] 2019 ASR Open 136M 3.6s 134.5Khr 4.0 words 240p ACAV [32] 2021 ASR Open 100M 10.0s 277.7Khr --YT-Temporal-180M [87] 2021 ASR Open 180M ----HD-VILA-100M [80] 2022 ASR Open 103M 13.4s 371.5Khr 32.5 words 720p MSVD [13] 2011 Manual caption Open 1970 9.7s 5.3h 8.7 words -LSMDC [58] 2015 Manual caption Movie 118K 4.8s 158h 7.0 words 1080p MSR-VTT [79] 2016 Manual caption Open 10K 15.0s 40h 9.3 words 240p DiDeMo [3] 2017 Manual caption Flickr 27K 6.9s 87h 8.0 words -ActivityNet [11] 2017 Manual caption Action 100K 36.0s 849h 13.5 words -YouCook2 [93] 2018 Manual caption Cooking 14K 19.6s 176h 8.8 words -VATEX [73] 2019 Manual caption Open 41K ⇠10s ⇠115h 15.