EMNLP2025

RACCooN: Versatile Instructional Video Editing with Auto-Generated Narratives

Jaehong Yoon, Shoubin Yu, Mohit Bansal

2 citations

Abstract

Recent video generative models primarily rely on detailed, labor-intensive text prompts for tasks, like inpainting or style editing, limiting adaptability for personal/raw videos. This paper proposes RACCOON, a versatile and userfriendly video-to-paragraph-to-video editing method, supporting diverse video editing capabilities, such as removal, addition, and modification, through a unified pipeline. RAC-COON consists of two main stages: Video-to-Paragraph (V2P), which automatically generates structured descriptions of scene and object details, and Paragraph-to-Video (P2V), where users can refine these to guide a video diffusion model for flexible content edits, including removing, changing, or adding objects. Key contributions of RACCOON include: (1) A multi-granular spatiotemporal pooling strategy for structured video understanding, capturing both global context and fine-grained object details to enable precise text-based video editing without complex human annotations. (2) A video generative model fine-tuned on a curated video-paragraph-mask dataset for improved editing and inpainting. (3) The ability to generate new objects by forecasting motion via auto-generated mask planning. In the end, users can easily edit complex videos with RAC-CooN's automatic explanations and guidance. We demonstrate its versatile capabilities in video-to-paragraph generation (up to 9.4%p Ò improvement in human evaluations), video content editing (relative 49.7% Ó in FVD).