NeurIPS2022

Masked Autoencoders that Listen

Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, Christoph Feichtenhofer

442 citations

Abstract

This paper studies a simple extension of image-based Masked Autoencoders (MAE) [1] to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training. Our code and models is available at https://github.com/facebookresearch/AudioMAE . Introduction Transformers [2] and self-supervised learning [3, 4, 5, 6, 7, 1] are dominating computer vision (CV) and natural language processing (NLP) research. The revolution firstly started in NLP with the invention of the Transformer architecture and self-attention [8] . Masked autoencoding with BERT [3] set a new state-of-the-art on various NLP tasks by self-supervised pre-training on large-scale language corpus. Similarly in the CV community, Vision Transformers (ViT) [9] have become popular for CV tasks, and, for self-supervised image representation learning, Masked Autoencoders (MAE) [1] have brought the CV community closer to the success of BERT in NLP. In addition to the existing masked autoencoders that can read (BERT) or see (MAE), in this work we study those that can listen. Transformer-based models have recently refreshed leaderboards for audio understanding tasks. For example, AST [10] and MBT [11] improved the audio classification performance on the AudioSet [12], Event Sound Classification [13], etc. The key technique behind this is initialization of audio model weights with ImageNet pre-trained supervised models (e.g., DeiT [14]) by deflating patch embeddings and interpolating positional embeddings for encoding audio spectrograms. However, exploiting ImageNet pre-trained models could be sub-optimal. Unlike initializing video models with weights from image models (e.g., the initial weights of I3D [15] or 3D-ResNets [16] are inflated from ImageNet pre-trained image models), there are clear and notable discrepancies between spectrograms representing audio content and natural images. It remains unclear why such heterogeneous image-toaudio transfer is useful beyond arguably similar low-level semantics such as shapes of spectrograms and shapes of visual objects. Further, any label bias would inevitably be transferred to audio models. Addressing these concerns, self-supervised audio representation learning has recently attracted much research attention. Based on BEiT [17] that learns to reconstruct image patches or learnt patch tokens, SS-AST [18] extends to the audio domain and exploits spectrograms (akin to 1-channel 2D images) and use both contrastive and reconstruction objective as self-supervision. Without using any labels, the key enabler to effective self-supervised representation learning is large-scale pre-training data. In this work we use AudioSet [12] for pre-training, a common dataset containing ∼2 million audio recordings. Performing large-scale training with Transformer architectures is challenging as self-attention in Transformers has quadratic complexity w.r.t. the length of input sequence. 36th Conference on Neural Information Processing Systems (NeurIPS 2022).