ACL2024

Enhancing EEG-to-Text Decoding through Transferable Representations from Pre-trained Contrastive EEG-Text Masked Autoencoder

Jiaqi Wang, Zhenxi Song, Zhengyu Ma, Xipeng Qiu, Min Zhang, Zhiguo Zhang

摘要

Reconstructing natural language from noninvasive electroencephalography (EEG) holds great promise as a language decoding technology for brain-computer interfaces (BCIs). However, EEG-based language decoding is still in its nascent stages, facing several technical issues such as: 1) Absence of a hybrid strategy that can effectively integrate cross-modality (between EEG and text) self-learning with intramodality self-reconstruction of EEG features or textual sequences; 2) Under-utilization of large language models (LLMs) to enhance EEG-based language decoding. To address above issues, we propose the Contrastive EEG-Text Masked Autoencoder (CET-MAE), a novel model that orchestrates compound selfsupervised learning across and within EEG and text through a dedicated multi-stream encoder. Furthermore, we develop a framework called E2T-PTR (EEG-to-Text decoding using Pretrained Transferable Representations), which leverages pre-trained modules alongside the EEG stream from CET-MAE and further enables an LLM (specifically BART) to decode text from EEG sequences. Comprehensive experiments conducted on the popular textevoked EEG database, ZuCo, demonstrate the superiority of E2T-PTR, which outperforms the state-of-the-art in ROUGE-1 F1 and BLEU-4 scores by 8.34% and 32.21%, respectively. These results indicate significant advancements in the field and underscores the proposed framework's potential to enable more powerful and widespread BCI applications. 1 vron et al., 2023; Ouyang et al., 2022), translat-047 ing complex spatio-temporal EEG signals into nu-048 anced textual representations, which is known as 049 EEG-to-Text, is achievable. Compared to con-050 ventional paradigms of brain-computer interfaces 051 (BCIs), such as motor imagery (MI) (Al-Saegh 052 et al., 2021), steady-state visual evoked potential 053 (SSVEP) (Wang et al., 2017), and P300 (Cecotti 054 and Graser, 2011), EEG-to-Text can convey much 055 more intended commands from the human brain 056 to computers, and thus presents a more extensive 057 range of applications. Its potential as a novel and 058 powerful BCI paradigm marks a significant ad-059 vancement in the field of BCIs. 060 Several existing EEG-to-Text studies (Li et al., 061 2022a; Chien et al., 2022) were focused on de-062 veloping specialized pre-trained models for EEG 063 only, aiming to extract universal semantic repre-064 sentations from the human brain. However, the 065 pre-trained model bridging EEG and text has been 066 ignored, which may be important to enhance the 067 representation learning for inter-modality conver-068 129 can leverage CET-MAE's pre-trained EEG 130 representations and the capabilities of LLMs 131 (BART) for text generation. 132 • Conducting extensive EEG-to-Text experi-133 ments on three, four, and five reading tasks in 134 ZuCo. Our experiments are more comprehen-135 sive than previous works by using more data 136 and including more methods for comparison. 137 Results show that our framework surpasses 138 previous works, and, thus, sets new SOTA 139 standards. 140 2 Related Works 141 2.1 Self-supervised Representations Learning 142 Multimodal self-supervised representation learning 143 aims to explore the interactions between different 144 modalities to produce semantically generalizable 145 representations for downstream tasks. 146 In recent years, there have been substantial pro-147 gresses across various modalities, such as vision-148 language pre-training (Zhao et al., 2023b; Lin et al., 149 2023). A range of existing methods rely on con-150 trastive learning, which can effectively draw closer 151 to the global representations of matched pairs in 152 latent spaces with semantic-level self-supervised 153 constraints. But contrastive learning sometimes 154 tends to overlook the self-information of individ-155 ual modalities, particularly at more granular lev-156 els. On the other hand, multimodal masked signal 157 modeling integrates cross-modality self-learning 158 with intra-modality self-reconstruction, focusing 159 on reconstructing one modality from another. This 160 approach may help the model learn the associa-161 tions between modalities. However, it may lead 162 to an excessive emphasis on fine-grained details, 163 potentially weakening the overall cross-modality 164 correlation and causing issues such as insensitivity 165 to whether the inputs are matched pairs. A series of 166 recent works, such as CMAE (Huang et al., 2023), 167 CAV-MAE (Gong et al., 2023) and SimVTP (Ma 168 et al., 2022), have already successfully integrated 169 both contrastive learning and masked signal mod-170 eling so that their complement advantages can be 171 utilized. 172 Our work draws inspiration from the above SSL 173 methods but with a novel strategy. In the pro-174 posed CET-MAE, the utilization of both text and 175 EEG streams not only achieves an explicit con-176 trastive learning objective to capture global coor-177 dination but also avoids erroneous learning pro-1