NeurIPS2023

Pengi: An Audio Language Model for Audio Tasks

Soham Deshmukh, Benjamin Elizalde, Rita Singh, Huaming Wang

被引用 278 次

摘要

In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity to produce the requisite language for open-ended tasks, such as Audio Captioning or Audio Question Answering. We introduce Pengi, a novel Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. It takes as input, an audio recording, and text, and generates free-form text as output. The input audio is represented as a sequence of continuous embeddings by an audio encoder. A text encoder does the same for the corresponding text input. Both sequences are combined as a prefix to prompt a pre-trained frozen language model. The unified architecture of Pengi enables open-ended tasks and close-ended tasks without any additional fine-tuning or task-specific extensions. When evaluated on 21 downstream tasks, our approach yields state-of-the-art performance in several of them. Our results show that connecting language models with audio models is a major step towards general-purpose audio understanding 1 . Figure 1: Examples of audio and text prompt inputs and their corresponding textual responses. Images are for illustration purposes only. Our proposed model Pengi enables close-ended tasks, such as classification or retrieval and open-ended tasks, such as captioning or question & answering. Audio Encoder. The audio encoder a ϕ transforms the raw audio input into an audio embedding. We used the audio transformer backbone from CLAP [17] as our audio encoder due to its success in diverse audio and multimodal tasks. Models in Computer Vision [44, 2, 41 ] use a frozen image encoder like CLIP, but CLAP is trained on a magnitude smaller collection of audio-text pairs. Therefore, we unfroze its weights for our training procedure.