STOC2025

Testing Support Size More Efficiently Than Learning Histograms

Renato Ferreira Pinto Jr., Nathaniel Harms

Abstract

Consider two problems about an unknown probability distribution 𝑝: (1) How many samples from 𝑝 are required to test if 𝑝 is supported on 𝑛 elements or not? Specifically, given samples from 𝑝, determine whether it is supported on at most 𝑛 elements, or it is “𝜀-far” (in total variation distance) from being supported on 𝑛 elements. (2) Given𝑚 samples from 𝑝, what is the largest lower bound on its support size that we can produce? The best known upper bound for problem (1) uses a general algorithm for learning the histogram of the distribution 𝑝, which requires Θ( 𝑛 𝜀2 log𝑛) samples .We showthat testing can be done more efficiently than learning the histogram, using only𝑂( 𝑛 𝜀 log𝑛 log(1/𝜀)) samples, nearly matching the best known lower bound of Ω( 𝑛 𝜀 log𝑛). This algorithm also provides a better solution to problem (2), producing larger lower bounds on support size than what follows from previous work. The proof relies on an analysis of Chebyshev polynomial approximations outside the range where they are designed to be good approximations.