NeurIPS2021

Private and Non-private Uniformity Testing for Ranking Data

Róbert Busa-Fekete, Dimitris Fotakis, Emmanouil Zampetakis

被引用 4 次

摘要

We study the problem of uniformity testing for statistical data that consists of rankings over m items, where the alternative class is restricted to Mallows models. Testing ranking data is challenging because of the size of the large domain that is factorial in m, therefore the tester needs to take advantage of some structure of the alternative class. We show that uniform distribution can be distinguished from Mallows model with O(m 1/2 ) samples based on simple pairwise statistics, which allows us to test uniformity using only two samples, if m is large enough. We also consider uniformity testing with central and local differential privacy (DP) constraints. We present a central DP algorithm that requires O(max1/✏ 0 , 1/ p m), where ✏ 0 is the privacy budget parameter. Interestingly, our uniformity testing algorithm is straightforward to apply to the local DP scenario, since it works with binary statistics that is extracted from the ranking data. We carry out large-scale experiments, including m = 10, 000, to show that our uniformity testing algorithms scale gracefully with m. 35th Conference on Neural Information Processing Systems (NeurIPS 2021). Related Work Testing uniformity is one of the most fundamental problem in computer science. Goldreich and Ron [22] considered first uniformity testing problem as a property testing, however with L2 distance. Paninski [29] came up with a coincidence-based approach that used total variation distance with a sample complexity O( p d/✏ 2 ), where d is the domain size, and it was shown to be optimal by with a restriction that ✏ 2 ⌦(d 1/4 ). The test statistic used by this optimal test is based on number of bins into which just one sample has fallen. In principle, this test can be applied to ranking data, since the test statistic is easy to compute. Nevertheless, the lower bound of this test, which is ⌦( p d/✏ 2 ), suggests that it is not the proper choice of method even for ranking data with small m, since d = m!.