ACL2023

On the Difference of BERT-style and CLIP-style Text Encoders

Zhihong Chen, Guiming Chen, Shizhe Diao, Xiang Wan, Benyou Wang

被引用 12 次

摘要

Masked language modeling (MLM) has been one of the most popular pretraining recipes in natural language processing, e.g., BERT, one of the representative models. Recently, contrastive language-image pretraining (CLIP) has also attracted attention, especially its vision models that achieve excellent performance on a broad range of vision tasks. However, few studies are dedicated to studying the text encoders learned by CLIP. In this paper, we analyze the difference between BERT-style and CLIP-style text encoders from three experiments: (i) general text understanding, (ii) vision-centric text understanding, and (iii) text-to-image generation. Experimental analyses show that although CLIP-style text encoders underperform BERTstyle ones for general text understanding tasks, they are equipped with a unique ability, i.e., synesthesia, for the cross-modal association, which is more similar to the senses of humans. Our code is released at https://github.com/ zhjohnchan/probing-clip-dev . * Equal Contribution. † Corresponding authors. 1 Similar exploration can be extended to decoder-based models as well.