NeurIPS2021

MarioNette: Self-Supervised Sprite Learning

Dmitriy Smirnov, Michaël Gharbi, Matthew Fisher, Vitor Guizilini, Alexei A. Efros, Justin M. Solomon

44 citations

Abstract

Artists and video game designers often construct 2D animations using libraries of sprites-textured patches of objects and characters. We propose a deep learning approach that decomposes sprite-based video animations into a disentangled representation of recurring graphic elements in a self-supervised manner. By jointly learning a dictionary of possibly transparent patches and training a network that places them onto a canvas, we deconstruct sprite-based content into a sparse, consistent, and explicit representation that can be easily used in downstream tasks, like editing or analysis. Our framework offers a promising approach for discovering recurring visual patterns in image collections without supervision. Since the early days of machine learning, the accepted unit of image synthesis has been the pixel. But while the pixel grid is a natural representation for display hardware and convolutional generators, it does not easily permit high-level reasoning and editing. In this paper, we take inspiration from animation to consider an atomic unit that is richer and easier to edit than the pixel: the sprite. In sprite-based animation, a popular early technique for drawing cartoons and rendering video games, an artist draws a collection of patches-a sprite sheetconsisting of texture swatches, characters in various poses, static objects, and so on. Then, each frame is assembled by compositing a subset of the patches onto a canvas. By reusing the sprite sheet, authoring new content requires minimal effort and can even be automated procedurally. Our goal is to invert this process, simultaneously tackling unsupervised instance segmentation and dictionary learning. Given an image dataset, e.g., frames from a sprite-based video game, we train a model that jointly learns a 2D sprite dictionary, capturing recurring visual elements in an image collection, and explains each input frame as a combination of these potentially transparent sprites. Whereas standard CNN-based generators hide their feature representation in their intermediate layers, our model wears its representation "on its sleeve": by explicitly compositing sprites from its learnt dictionary onto a background canvas, rather than synthesizing pixels from hidden neural features, it provides a readily-interpretable visual representation. Our contributions include the following: • We describe a grid-based anchor system along with a learned dictionary of textured patches (with transparency) to extract a sprite-based image representation. • We propose a method to learn the patch dictionary and the grid-based representation jointly, in a differentiable, end-to-end fashion. • We compare to past work on learned disentangled graphics representations for video games. • We show how our method offers promising avenues for further work towards identifying visual patterns in more complex data such as natural images and video. 35th Conference on Neural Information Processing Systems (NeurIPS 2021).