CVPR2025
Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content
Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, Fei Yang, Pengfei Wan, Di Zhang
Abstract
Panda-70M Unfiltered "A woman is cracking eggs into a bowl of spinach in the kitchen." Panda-70M Koala-36M "A woman is standing in a modern kitchen, engaging in a conversation or explaining something while gesturing with her hands. She is wearing a black top and has long, braided hair. The kitchen is well-lit with warm lighting, and there are various kitchen items on the counter, including a vase with red flowers, a bowl of eggs, and a green bowl. The woman appears to be in a cheerful mood, smiling……" Koala-36M "a person preparing a dish in a kitchen setting. The person is seen cracking eggs into a bowl filled with spinach leaves. The scene is focused on the hands and the bowl, with various kitchen items like a bottle of oil, a container of salt, and a teapot visible in the background. The person's hands are the main focus, showing the careful and deliberate action of cracking the eggs and adding them to the ……" "A shirtless man flexing his muscles in front of a crowd." Koala-36M "A muscular individual with a tattooed torso and arms is standing in front of a microphone, holding a piece of paper. The person is wearing a white tank top and appears to be in a celebratory or victorious mood, as indicated by their raised fists and the expression of triumph on their face. The background suggests a sports event, specifically a boxing match, as indicated by the presence of a microphone……" Koala-36M "Two men are engaged in a handshake, with one of them flexing his muscles. The man on the left has a heavily tattooed arm, with visible ink on his forearm and bicep. He is wearing a black sleeveless shirt and has a yellow wristband. The man on the right has a muscular build, with a tattoo on his right arm and a cap on his head. He is shirtless, revealing his well-defined muscles ……" VTSS 3.72 VTSS 4.13 VTSS 3.69 VTSS 3.56 Panda-70M Unfiltered Koala-36M Filter out (freeze-frame video) VTSS 1.05 VTSS 1.69 Koala-36M Filter out (overexposed video) Panda-70M Figure 1. Comparison between Koala-36M and Panda-70M. We propose a large-scale, high-quality dataset that significantly enhances the consistency between multiple conditions and video content. Koala-36M features more accurate temporal splitting, more detailed captions, and improved video filtering based on the proposed Video Training Suitability Score (VTSS).