ACL2023

KG-FLIP: Knowledge-guided Fashion-domain Language-Image Pre-training for E-commerce

Qinjin Jia, Yang Liu, Daoping Wu, Shaoyuan Xu, Huidong Liu, Jinmiao Fu, Roland Vollgraf, Bryan Wang

6 citations

Abstract

Various Vision-Language Pre-training (VLP) models (e.g., CLIP, BLIP) have sprung up and dramatically improved the benchmarks of public general-domain datasets (e.g., COCO, Flickr30k). Such models typically learn the cross-modal alignment from large-scale wellaligned image-text datasets. Adapting these models to downstream applications in specific domains, such as fashion, requires fine-grained in-domain image-text datasets. However, such datasets are usually less semantically aligned and smaller in scale, which requires more efficient pre-training strategies. In this paper, we propose a knowledge-guided fashion-domain language-image pre-training (KG-FLIP) framework that focuses on learning fine-grained representations in the e-commerce domain and utilizes external knowledge (i.e., product attribute schema) to improve the pre-training efficiency. Experimental results demonstrate that KG-FLIP outperforms previous state-of-the-art VLP models on Amazon data and the Fashion-Gen dataset by large margins. KG-FLIP has been successfully deployed in the Amazon catalog system to backfill missing attributes and improve the customer shopping experience.