ACL2024

Codec-SUPERB: An In-Depth Analysis of Sound Codec Models

Haibin Wu, Ho-Lam Chung, Yi-Cheng Lin, Yuan-Kuei Wu, Xuanjun Chen, Yu-Chi Pai, Hsiu-Hsuan Wang, Kai-Wei Chang, Alexander H. Liu, Hung-yi Lee

DOI Publisher

Abstract

The sound codec's dual roles in minimizing data transmission latency and serving as tokenizers underscore its critical importance. Recent years have witnessed significant developments in codec models. The ideal sound codec should preserve content, paralinguistics, speakers, and audio information. However, the question of which codec achieves optimal sound information preservation remains unanswered, as in different papers, models are evaluated on their selected experimental settings. This study introduces Codec-SUPERB, an acronym for Codec Sound processing Universal PERformance Benchmark. It is an ecosystem designed to assess codec models across representative sound applications and signal-level metrics rooted in sound domain knowledge. Codec-SUPERB simplifies result sharing through an online leaderboard, promoting collaboration within a communitydriven benchmark database, thereby stimulating new development cycles for codecs. Furthermore, we undertake an in-depth analysis to offer insights into codec models from both application and signal perspectives, diverging from previous codec papers mainly concentrating on signal-level comparisons. Finally, we will release codes, the leaderboard, and data to accelerate progress within the community. 1 Introduction Neural sound codec models were initially introduced to compress sound for efficient data transmission. The encoder of the codec model encodes the sound into codec codes, which are then transmitted. Subsequently, the codec decoder then resynthesizes the sound using the received codes. Neural codec codes can be utilized as tokens in sound language modeling (LM). LM has proven highly successful in Natural Language Processing (NLP). Sound data contains semantic content and rich information about speaker, emotion, and general audio, offering deeper possibilities for lan-042 guage model applications. Researchers recently 043 explored the potential of neural codecs (Défossez 044