NeurIPS2023

Optimality in Mean Estimation: Beyond Worst-Case, Beyond Sub-Gaussian, and Beyond 1+α Moments

Trung Dang, Jasper C. H. Lee, Maoyuan Raymond Song, Paul Valiant

被引用 6 次

摘要

There is growing interest in improving our algorithmic understanding of fundamental statistical problems such as mean estimation, driven by the goal of understanding the fundamental limits of what we can extract from limited and valuable data. The state of the art results for mean estimation in R are 1) the optimal sub-Gaussian mean estimator by [Lee and Valiant, 2022] , attaining the optimal sub-Gaussian error constant for all distributions with finite but unknown variance, and 2) the analysis of the median-of-means algorithm by [Bubeck, Cesa-Bianchi and Lugosi, 2013] and a matching lower bound by [Devroye, Lerasle, Lugosi, and Oliveira, 2016] , characterizing the big-O optimal errors for distributions that have tails heavy enough that only a 1 + α moment exists for some α ∈ (0, 1). Both of these results, however, are optimal only in the worst case. Motivated by the recent effort in the community to go "beyond the worst-case analysis" of algorithms, we initiate the fine-grained study of the mean estimation problem: Is it possible for algorithms to leverage beneficial features/quirks of their input distribution to beat the sub-Gaussian rate, without explicit knowledge of these features? We resolve this question, finding an unexpectedly nuanced answer: "Yes in limited regimes, but in general no". Given a distribution p, assuming only that it has a finite mean and absent any additional assumptions, we show how to construct a distribution q n,δ such that the means of p and q are well-separated, yet p and q are impossible to distinguish with n samples with probability 1δ, and q further preserves the finiteness of moments of p. Moreover, the variance of q is at most twice the variance of p if it exists. The main consequence of our result is that, no reasonable estimator can asymptotically achieve better than the sub-Gaussian error rate for any distribution, up to constant factors, which matches the worst-case result of [Lee and Valiant, 2022] . More generally, we introduce a new definitional framework to analyze the fine-grained optimality of algorithms, which we call "neighborhood optimality", interpolating between the unattainably strong "instance optimality" and the trivially weak admissibility/Pareto optimality definitions. As an application of the new framework, we show that the median-of-means algorithm is neighborhood optimal, up to constant factors. It is an open question to find a neighborhood-optimal estimator without constant factor slackness. 37th Conference on Neural Information Processing Systems (NeurIPS 2023). The classic median-of-means estimator [e.g., NY83; JVV86; AMS99], independently invented several times in the literature, was proposed to mitigate the sensitivity of the sample mean. Its accuracy on any distribution p with finite variance is-up to constant factors-as good as the accuracy of the sample mean on a Gaussian with the same mean and variance as p. In other words, the error is of sub-Gaussian rate. The past decade has seen renewed interest across computer science and statistics in understanding the limits of the mean estimation problem. In one dimension, the current state of the art is 1) for distributions p with finite variance σ 2 p , the recent mean estimator proposed by Lee and Valiant [LV22] has sub-Gaussian rate σ p • ( √ 2 + o(1)) log 1 δ /n, which is tight even in its constants, up to a 1 + o(1) factor, and 2) the work of Bubeck, Cesa-Bianchi and Lugosi [BCL13] showing upper bounds matching the lower bounds of Devroye, Lerasle, Lugosi and Oliveira [DLLO16], which show that for heavy-tailed distributions having only a 1 + α th moment for some α ∈ (0, 1), the median-of-means algorithm in fact still achieves the optimal accuracy up to a constant factor. Both of these results, however, are optimal only in the worst case, meaning that [LV22; DLLO16] have optimal performance with respect to the class of distributions with the same 2 nd or 1 + α th moments respectively; but estimators may do better than these bounds on particular input distributions.