ACL2024

CLASP: Cross-modal Alignment Using Pre-trained Unimodal Models

Jianing Zhou, Ziheng Zeng, Hongyu Gong, Suma Bhat

Abstract

Recent advancements in joint speech-text pretraining have significantly advanced the processing of natural language. However, a key limitation is their reliance on parallel speechtext data, posing challenges due to data accessibility. Addressing this, our paper introduces an innovative framework for jointly performing speech and text processing without parallel corpora during pre-training but only downstream. Utilizing pre-trained unimodal models, we extract distinct representations for speech and text, aligning them effectively in a newly defined space using a multi-level contrastive learning mechanism. A unique swap reconstruction mechanism enhances the alignment and is followed by fusion via a multi-head mechanism, seamlessly merging modality-invariant and modality-specific representations. Testing for emotion recognition (Spoken Language Understanding task) and idiom usage detection (Natural Language Understanding task) demonstrates robust performance, with commendable robustness to noise in text or speech data.