NeurIPS2023

Brain encoding models based on multimodal transformers can transfer across language and vision

Jerry Tang, Meng Du, Vy A. Vo, Vasudev Lal, Alexander Huth

被引用 61 次

摘要

Encoding models have been used to assess how the human brain represents concepts in language and vision. While language and vision rely on similar concept representations, current encoding models are typically trained and tested on brain responses to each modality in isolation. Recent advances in multimodal pretraining have produced transformers that can extract aligned representations of concepts in language and vision. In this work, we used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies. We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality, particularly in cortical regions that represent conceptual meaning. Further analysis of these encoding models revealed shared semantic dimensions that underlie concept representations in language and vision. Comparing encoding models trained using representations from multimodal and unimodal transformers, we found that multimodal transformers learn more aligned representations of concepts in language and vision. Our results demonstrate how multimodal transformers can provide insights into the brain's capacity for multimodal processing. Encoding models predict brain responses from quantitative features of the stimuli that elicited them [1] . In recent years, fitting encoding models to data from functional magnetic resonance imaging (fMRI) experiments has become a powerful approach for understanding information processing in the brain. While encoding models are usually trained and tested on brain responses to a single stimulus modality, such as language [2-8] or vision [9] [10] [11] [12] [13] [14] , the human brain is remarkable in its ability to integrate information across multiple modalities. There is growing evidence that this capacity for multimodal processing is supported by aligned cortical representations of the same concepts in different modalities-for instance, hearing the sentence "a dog chases a cat" and seeing a dog chasing a cat may elicit similar patterns of brain activity [15] [16] [17] [18] [19] [20] . In this work, we investigated the alignment between language and visual representations in the brain by training encoding models on fMRI responses to one modality and testing them on fMRI responses to the other modality. Encoding models that successfully transfer across modalities can provide insights into how the two modalities are related [19] . Although previous work has compared language and vision encoding models, human annotations were required to map language and visual stimuli into a shared semantic space [19] . To our knowledge, cross-modality transfer has yet to be demonstrated using encoding models trained on stimulus-computable features that capture the rich connections between language and vision. Preprint. Under review.