CVPR2023

Coreset Sampling from Open-Set for Fine-Grained Self-Supervised Learning

Sungnyun Kim, Sangmin Bae, Se-Young Yun

Abstract

Despite the increased interest in applying deep learning to specific domains, developing algorithms for fine-grained datasets suffers from two challenges: expert knowledge for annotation and the necessity of a versatile model for subordinate tasks in a specific domain. We can leverage the recent self-supervised learning approach to pretrain a model with the fine-grained dataset, serving as an effective initialization for any downstream tasks. Here, we introduce a novel Open-set Self-Supervised Learning problem with the assumption that a large-scale unlabeled open-set is available during a pretraining phase. In this problem setup, it is crucial to consider the distribution mismatch between pretraining and target datasets. Hence, we propose a SimCore algorithm to sample a coreset, the subset of open-set that has a minimum distance to the target dataset in a latent space. We demonstrate that SimCore significantly improves representation learning through extensive experimental settings with eight fine-grained datasets and two open-sets.