ICLR2025

TeaserGen: Generating Teasers for Long Documentaries

Weihan Xu, Paul Pu Liang, Haven Kim, Julian J. McAuley, Taylor Berg-Kirkpatrick, Hao-Wen Dong

Abstract

Teasers are an e ective tool for promoting content in entertainment, commercial and educational elds. However, creating an e ective teaser for long videos is challenging for it requires long-range multimodal modeling capability for the input videos, while necessitating maintaining audiovisual alignments, managing scene transitions and preserving factual accuracy for the output teasers. Due to the lack of a publicly-available dataset, progress along this research direction has been hindered. In this work, we present DocumentaryNet, a collection of 1,269 documentaries paired with their teasers, featuring multimodal data streams of video, speech, music, sound e ects and narrations. With DocumentaryNet, we propose a new two-stage system for generating teasers from long documentaries. The proposed TeaserGen system rst generates the teaser narration from the transcribed narration from the documentary using a pretrained large language model, and then selects the most relevant visual content to accompany the generated narration through language-vision models. For narration-video matching, we explore two approaches: a pretraining-based model using pretrained contrastive language-vision models and a deep sequential model that learns the mapping between the narrations and visuals. Our experimental results show that the pretraining-based approach is more e ective at identifying relevant visual content than directly trained deep autoregressive models.