NeurIPS2023

Synthcity: a benchmark framework for diverse use cases of tabular synthetic data

Zhaozhi Qian, Robert Davis, Mihaela van der Schaar

53 citations

Abstract

Accessible high-quality data is the bread and butter of machine learning research, and the demand for data has exploded as larger and more advanced ML models are built across different domains. Yet, real data often contain sensitive information, subject to various biases, and are costly to acquire, which compromise their quality and accessibility. Synthetic data have thus emerged as a complement, sometimes even a replacement, to real data for ML training. However, the landscape of synthetic data research has been fragmented due to the large number of data modalities (e.g., tabular data, time series data, images, etc.) and various use cases (e.g., privacy, fairness, data augmentation, etc.). This poses practical challenges in comparing and selecting synthetic data generators in different problem settings. To this end, we develop Synthcity, an open-source Python library that allows researchers and practitioners to perform one-click benchmarking of synthetic data generators across data modalities and use cases. In addition, Synthcity's plug-in style API makes it easy to incorporate additional data generators into the framework. Beyond benchmarking, it also offers a single access point to a diverse range of cutting-edge data generators. Through examples on tabular data generation and data augmentation, we illustrate the general applicability of Synthcity, and the insight one can obtain.