ACL2025
HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models
Xiao Wang, Jingyun Hua, Weihong Lin, Yuanxing Zhang, Fuzheng Zhang, Jianlong Wu, Di Zhang, Liqiang Nie
被引用 1 次
摘要
Recent Multi-modal Large Language Models (MLLMs) have made great progress in video understanding. However, their performance on videos involving human actions is still limited by the lack of high-quality data. To address this, we introduce a two-stage data annotation pipeline. First, we design strategies to accumulate videos featuring clear human actions from the Internet. Second, videos are annotated in a standardized caption format that uses human attributes to distinguish individuals and chronologically details their actions and interactions. Through this pipeline, we curate two datasets, namely HAICTrain and HAICBench. HAICTrain comprises 126K video-caption pairs generated by Gemini-Pro and verified for training purposes. Meanwhile, HAICBench includes 412 manually annotated video-caption pairs and 2,000 QA pairs, for a comprehensive evaluation of human action understanding. Experimental results demonstrate that training with HAICTrain not only significantly enhances human understanding abilities across 4 benchmarks, but can also improve text-to-video generation results. Both the HAICTrain and HAICBench are released at https://huggingface.co/ datasets/KuaishouHAIC/HAIC . * Equal contribution. † Work done as an intern at Kuaishou Technology. ‡ Corresponding author. MSR-VTT: Two men are boxing on a stage. LLaVA-Video: The video opens with two fighters in a fighting stance inside an octagonal cage. The fighter on the left wears black shorts with green and white accents, while the fighter on the right sports blue shorts with yellow and white details. Both fighters are equipped with gloves and appear to be in the midst of a match. The sequence shows the fighters exchanging punches and grappling near the center of the cage. The audience is visible in the background, watching the fight intently. The fighters move around the cage, with one attempting to land strikes while the other defends and counters. The video concludes with the fighters still engaged in combat, with the referee closely monitoring the action. HAIC: The main characters are two people. The first subject is a middle-aged man, wearing purple boxing shorts with black and green patterns and white letters, and black and red boxing gloves. Initially, he faces his opponent in an attacking stance, then bends down and rushes towards Subject 2, throwing a punch. After Subject 2 dodges, he chases Subject 2 and throws five right hooks. The second subject is a middle-aged man, wearing black boxing shorts with green patterns and white letters, and black and blue boxing gloves. Initially, he faces Subject 1, then he bends down to dodge Subject 1's attack, stumbles, and is chased and punched by Subject 1. Finally, he straightens up, puts his hands over his head, breaks away from the attack, and the two confront each other.