NeurIPS2020

Worst-Case Analysis for Randomly Collected Data

Justin Y. Chen, Gregory Valiant, Paul Valiant

4 citations

Abstract

We introduce a framework for statistical estimation that leverages knowledge of how samples are collected but makes no distributional assumptions on the data values. Specifically, we consider a population of elements 1, . . . , n with corresponding values x 1 , . . . , x n . We observe the values for a sample set A ⊂ 1, . . . , n and wish to estimate some statistic of the values for a target set B ⊂ 1, . . . , n where B could be the entire set. Crucially, we assume that the sets A and B are drawn according to some known joint distribution (A, B) ∼ P over pairs of subsets of 1, . . . , n. A given estimation algorithm is evaluated based on its worstcase, expected error where the expectation is with respect to the distribution P from which the sample A and target set B are drawn, and the worst-case is with respect to the data values x 1 , . . . , x n . Within this general framework we give an efficient algorithm to find an estimator for the target mean, as a weighted combination of the input sample-where the weights are a function of the distribution P and the identities of the elements in the sample and target sets A, B. We show that the worst-case expected error achieved by this estimator is at most a multiplicative π/2 factor worse than the optimum for such estimators. A component of this algorithm can also be used to approximate the worst-case expected error of a given estimator. The algorithm and proof leverage a surprising connection to the Grothendieck problem. We extend these results to the setting of linear regression, where each datapoint is not a scalar but a labeled vector (x i , y i ) ∈ R d+1 . Our framework, which makes no distributional assumptions on the data values but rather relies on knowledge of the data collection process via the distribution P , is a significant departure from the typical statistical estimation framework and introduces a uniform algorithmic analysis for the many natural settings where membership in a sample may be correlated with data values, such as when probabilities of sampling vary as in "importance sampling", when individuals are recruited into a sample via a social network as in "snowball sampling" or "respondent-driven sampling" [12, 14] or when samples have chronological structure as in "selective prediction" [10, 21] . We experimentally demonstrate the benefit of this framework and our algorithm in comparison to standard estimators, for several such settings.