ACL2021

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

Wei-Ning Hsu, David Harwath, Tyler Miller, Christopher Song, James R. Glass

摘要

In this paper we present the first model for directly synthesizing fluent, naturalsounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision. Instead, we connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised visual grounding task. We conduct experiments on the Flickr8k spoken caption dataset in addition to a novel corpus of spoken audio captions collected for the popular MSCOCO dataset, demonstrating that our generated captions also capture diverse visual semantics of the images they describe. We investigate several different intermediate speech representations, and empirically find that the representation must satisfy several important properties in order to work well. a person in a blue jacket is on a snowboard on a snow covered slope a snowboarder is snowboarding on the side of the mountain a snowboarder is snowboarding on the side of the mountain Same unit sequence, different speakers Different unit sequences, same speaker * Equal contribution † The author performed the work while at MIT, and is now at Facebook AI Research Self-Supervised Learning for Speech and Audio Processing Workshop @ NeurIPS 2020.