CVPR2022

Unified Contrastive Learning in Image-Text-Label Space

Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, Jianfeng Gao

被引用 182 次

摘要

Visual recognition is recently learned via either super-vised learning on human-annotated image-label data or language-image contrastive learning with webly-crawled image-text pairs. While supervised learning may result in a more discriminative representation, language-image pretraining shows unprecedented zero-shot recognition ca-pability, largely due to the different properties of data sources and learning objectives. In this work, we intro-duce a new formulation by combining the two data sources into a common image-text-label space. In this space, we propose a new learning paradigm, called Unified Con-trastive Learning (UniCL) with a single learning objective to seamlessly prompt the synergy of two data types. Ex-tensive experiments show that our UniCL is an effective way of learning semantically rich yet discriminative repre-sentations, universally for image recognition in zero-shot, linear-probing, fully finetuning and transfer learning sce-narios. Particularly, it attains gains up to 9.2% and 14.5% in average on zero-shot recognition benchmarks over the language-image contrastive learning and supervised learning methods, respectively. In linear probe setting, it also boosts the performance over the two methods by 7.3% and 3.4%, respectively. Our study also indicates that UniCL stand-alone is a good learner on pure image-label data, rivaling the supervised learning methods across three im-age classification datasets and two types of vision back-bones, ResNet and Swin Transformer. Code is available at: https://github.com/microsoft/UniCL.