NeurIPS2020

Truncated Linear Regression in High Dimensions

Constantinos Daskalakis, Dhruv Rohatgi, Emmanouil Zampetakis

被引用 18 次

摘要

As in standard linear regression, in truncated linear regression, we are given access to observations (A i , y i ) i whose dependent variable equals y i = A T i • x * + η i , where x * is some fixed unknown vector of interest and η i is independent noise; except we are only given an observation if its dependent variable y i lies in some "truncation set" S ⊂ R. The goal is to recover x * under some favorable conditions on the A i 's and the noise distribution. We prove that there exists a computationally and statistically efficient method for recovering k-sparse n-dimensional vectors x * from m truncated samples, which attains an optimal 2 reconstruction error of O( (k log n)/m). As a corollary, our guarantees imply a computationally efficient and information-theoretically optimal algorithm for compressed sensing with truncation, which may arise from measurement saturation effects. Our result follows from a statistical and computational analysis of the Stochastic Gradient Descent (SGD) algorithm for solving a natural adaptation of the LASSO optimization problem that accommodates truncation. This generalizes the works of both: (1) Daskalakis et al. [9], where no regularization is needed due to the low-dimensionality of the data, and (2) Wainright [26] , where the objective function is simple due to the absence of truncation. In order to deal with both truncation and high-dimensionality at the same time, we develop new techniques that not only generalize the existing ones but we believe are of independent interest. 1 regularization, i.e. what is called LASSO optimization in Statistics, in order to reward sparsity. Another common deviation from the standard model is the presence of truncation. Truncation occurs when the sample (A i , y i ) is not observed whenever y i falls outside of a subset S ⊆ R. Truncation arises quite often in practice as a result of saturation of measurement devices, bad data collection practices, incorrect 1