CVPR2023

EXIF as Language: Learning Cross-Modal Associations between Images and Camera Metadata

Chenhao Zheng, Ayush Shrivastava, Andrew Owens

Abstract

Figure 1. (a) We learn a joint embedding between image patches and the EXIF metadata that cameras automatically insert into image files. Our model treats this metadata as a language-like modality: we convert the EXIF tags to text, concatenate them together, and then processes the result with a transformer. (b) We apply our representation to tasks that require understanding camera properties. For example, we can detect image splicing "zero shot" (and without metadata at test time) by finding inconsistent embeddings within an image. We show a manipulated image that contains content from two source photos. Since these photos were captured with different cameras, the two regions have dissimilar embeddings (visualized by PCA). We localize the splice by clustering the image's patch embeddings.