EMNLP2024

Towards Online Continuous Sign Language Recognition and Translation

Ronglai Zuo, Fangyun Wei, Brian Mak

14 citations

Abstract

Research on continuous sign language recognition (CSLR) is essential to bridge the communication gap between deaf and hearing individuals. Numerous previous studies have trained their models using the connectionist temporal classification (CTC) loss. During inference, these CTC-based models generally require the entire sign video as input to make predictions, a process known as offline recognition, which suffers from high latency and substantial memory usage. In this work, we take the first step towards online CSLR. Our approach consists of three phases: 1) developing a sign dictionary; 2) training an isolated sign language recognition model on the dictionary; and 3) employing a sliding window approach on the input sign sequence, feeding each sign clip to the optimized model for online recognition. Additionally, our online recognition model can be extended to support online translation by integrating a gloss-to-text network and can enhance the performance of any offline model. With these extensions, our online approach achieves new state-of-the-art performance on three popular benchmarks across various task settings. Code and models are available at https://github.com/FangyunWei/SLRT . CTC Decoder CSLR Model CTC Loss Ground Truth: Wind/Strong/Blow Sign Video CSLR Model Prediction Sign Video Training Inference (a) Training and inference of previous offline recognition models that are trained using the CTC loss. These models require access to the entire sign video before they can make predictions. ISLR Model Sliding Window ISLR Model Wind Strong Blow … … … … Dict Classification Loss Post-Processing Prediction Training Inference Gloss Instances (b) Training and inference in our online approach. We utilize a pre-trained CSLR model to segment all continuous sign videos into isolated sign clips. This process creates a dictionary for each CSLR dataset, which supports the subsequent training of an ISLR model. During inference, we apply a sliding window to the input sign stream and perform on-the-fly predictions. The function of post-processing is to eliminate duplicates and background predictions.