NeurIPS2023
StoryBench: A Multifaceted Benchmark for Continuous Story Visualization
Emanuele Bugliarello, H. Hernan Moraldo, Ruben Villegas, Mohammad Babaeizadeh, Mohammad Taghi Saffar, Han Zhang, Dumitru Erhan, Vittorio Ferrari, Pieter-Jan Kindermans, Paul Voigtlaender
22 citations
Abstract
benefit of training text-to-video models to continue videos (rather than generating them from scratch) on story-like data, but they also suggest discrepancies between human and automated evaluation. We invite the community to report results on STORYBENCH at https://paperswithcode.com/ dataset/storybench . We provide data, annotation instructions, human evaluation guidelines, and code for automatic evaluation at https://github.com/google/storybench . Related Work Text-guided generative models for vision. We are currently witnessing tremendous progress in the task of text-to-image generation and editing [8] [9] [10] [11] [22] [23] [24] [25] [26] [27] , fueled by sequence-to-sequence Transformer models [28] and diffusion models [29] trained on massive amounts of image-text data [15] . Research on text-to-video generation has also received attention recently. GODIVA [30] autoregressively generates videos from text using a three-dimensional sparse attention mechanism. NÜWA [31] presents a unified framework for multi-task learning of various generation tasks, including text-to-video. NUWA-Infinity [32] is a generative model that can synthesize arbitrarily-sized images or long-duration videos with an autoregressive over autoregressive generation mechanism. In a similar spirit, NÜWA-XL [33] proposes a diffusion over diffusion approach that allows to generate long videos in parallel through a coarse-to-fine process. CogVideo [34] adds temporal attention modules on top of a frozen text-to-image model to reduce the computational requirements for textto-video learning. Make-a-Video [35] also starts from a text-to-image model but fine-tunes it while adding pseudo-3D convolution and temporal attention layers. Concurrently, Video Latent Diffusion Models [36] turn pretrained image diffusion models into video generators by fine-tuning them with temporal alignment layers, and Imagen Video [37] generates high definition videos using a cascade of video diffusion models. Ho et al. [38] train space-time factorized U-Net models [39] on images and videos, and propose a sampling method to improve longer video generations. Our baselines are based on Phenaki [21] , which can generate arbitrary long videos from a sequence of text prompts. Story visualization. In this paper, we propose STORYBENCH, a benchmark for the task of generating a video from a sequence of text prompts (i.e., a story), which we refer to as continuous story visualization. In the literature, story visualization [40] is the task of generating a sequence of images to narrate a multi-sentence story (one image per sentence) with a global visual consistency across dynamic scenes and entities. The authors created two artificial datasets from CLEVR [41] and Pororo [42] , and proposed a model based on sequential conditional GANs. To improve story visualization, Maharana et al. propose a dual learning framework and a copy mechanism in [43] , and leverage grammatical and visual structure as well as commonsense information in [44] . In [45] , the authors introduce the DiDeMoSV dataset, and propose to 'retro-fit' a pretrained text-to-image model with task-specific modules to improve on the task of story continuation, resulting in StoryDALL-E. Finally, Rahman et al. [46] extend the synthetic MUGEN dataset [47] for multi-sentence storylines, and propose an autoregressive diffusion-based framework with a visual memory module to capture the entities and background context across the generated frames. Unlike previous work, in STORYBENCH, we focus on generating continuous videos (rather than key-frames) on natural (rather than cartoon or synthetic) data. Moreover, we also use DiDeMo to visualize stories but rather than using the existing temporal queries and automatically matching them to key-frames [45], we ask human annotators to thoroughly describe the story of the videos while manually annotating timestamps for each sentence. StoryBench Aiming for a comprehensive resource to assess the ability of generative models to visualize stories, we propose STORYBENCH, the first real-world benchmark for text-to-video story generation. Unlike previous work which frames story visualization as the task of generating a single key-frame per text prompt, STORYBENCH evaluates the ability of generative models to synthesize continuous, natural videos from a sequence of text prompts. To do so, we collect rich annotations that provide insights and nuances of any model's capabilities, and easily discover failure modes. STORYBENCH consists of three different datasets, three tasks of increasing difficulty, and three evaluation setups. Generating videos is a very complex task for state-of-the-art models. Some of the key challenges involve generating videos that (i) have a coherent storyline, (ii) are visually realistic, and (iii) can be controlled according to user intent. STORYBENCH aims at benchmarking these three challenges by (i) defining three tasks with increased difficulty in storyline; (ii) foc