CVPR2024

On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm

Peng Sun, Bei Shi, Daiwei Yu, Tao Lin

17 citations

Abstract

Contemporary machine learning, which involves training large neural networks on massive datasets, faces significant computational challenges. Dataset distillation, as a recent emerging strategy, aims to compress real-world datasets for efficient training. However, this line of research currently struggles with large-scale and high-resolution datasets, hindering its practicality and feasibility. Thus, we re-examine existing methods and identify three properties essential for real-world applications: realism, diversity, and efficiency. As a remedy, we propose RDED, a novel computationally-efficient yet effective data distillation paradigm, to enable both diversity and realism of the distilled data. Extensive empirical results over various model architectures and datasets demonstrate the advancement of RDED: we can distill a dataset to 10 images per class from full ImageNet-1K [6] within 7 minutes, achieving a notable 42% accuracy with ResNet-18 [14] on a single RTX-4090 GPU (while the SOTA only achieves 21% but requires 6 hours). Code: https://github.com/LINs-1ab/RDED.