ACL2023

A Theory of Unsupervised Speech Recognition

Liming Wang, Mark Hasegawa-Johnson, Chang Dong Yoo

被引用 4 次

摘要

Unsupervised speech recognition (pasted macro 'ASRU'/) is the problem of learning automatic speech recognition (ASR) systems from unpaired speech-only and text-only corpora. While various algorithms exist to solve this problem, a theoretical framework is missing to study their properties and address such issues as sensitivity to hyperparameters and training instability. In this paper, we proposed a general theoretical framework to study the properties of pasted macro 'ASRU'/ systems based on random matrix theory and the theory of neural tangent kernels. Such a framework allows us to prove various learnability conditions and sample complexity bounds of pasted macro 'ASRU'/. Extensive pasted macro 'ASRU'/ experiments on synthetic languages with three classes of transition graphs provide strong empirical evidence for our theory (code available at https://github.com/cactuswiththoughts/UnsupASRTheory.gitcactuswiththoughts/UnsupASRTheory.git).