ACL2021

VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-words

Xiaopeng Lu, Tiancheng Zhao, Kyusong Lee

Abstract

Text-to-image retrieval is an essential task in cross-modal information retrieval, i.e., retrieving relevant images from a large and unlabelled dataset given textual queries. In this paper, we propose VisualSparta, a novel (Visualtext Sparse Transformer Matching) model that shows significant improvement in terms of both accuracy and efficiency. VisualSparta is capable of outperforming previous stateof-the-art scalable methods in MSCOCO and Flickr30K. We also show that it achieves substantial retrieving speed advantages, i.e., for a 1 million image index, VisualSparta using CPU gets ∼391X speedup compared to CPU vector search and ∼5.4X speedup compared to vector search with GPU acceleration. Experiments show that this speed advantage even gets bigger for larger datasets because Visu-alSparta can be efficiently implemented as an inverted index. To the best of our knowledge, VisualSparta is the first transformer-based textto-image retrieval model that can achieve realtime searching for large-scale datasets, with significant accuracy improvement compared to previous state-of-the-art methods.