ICLR2026

SVD Provably Denoises Nearest Neighbor Data

Ravindran Kannan, Kijun Shin, David Woodruff

Abstract

We study the Nearest Neighbor Search (NNS) problem in a high-dimensional setting where data originates from a low-dimensional subspace and is corrupted by Gaussian noise. Specifically, we consider a semi-random model where nn points from an unknown kk-dimensional subspace of Rd\mathbb{R}^d (kdk \ll d) are perturbed by zero-mean dd-dimensional Gaussian noise with variance σ2\sigma^2 on each coordinate. Without loss of generality, we may assume the nearest neighbor is at distance 11 from the query, and that all other points are at distance at least 1+ε1+\varepsilon. We assume we are given only the noisy data and are required to find NN of the uncorrupted data. We prove the following results:

  1. For σO(1/k1/4)\sigma \in O(1/k^{1/4}), we show that simply performing SVD denoises the data; namely, we provably recover accurate NN of uncorrupted data (Theorem 1.1).
  2. For σ1/k1/4\sigma \gg 1/k^{1/4}, NN in uncorrupted data is not even identifiable from the noisy data in general. This is a matching lower bound on σ\sigma with the above result, demonstrating the necessity of this threshold for NNS (Lemma 3.1).
  3. For σ1/k\sigma \gg 1/\sqrt k, the noise magnitude (σd\sigma \sqrt{d}) is significantly exceeds the inter-point distances in the unperturbed data. Moreover, NN in noisy data is different from NN in the uncorrupted data in general. enumerate

Note that (1) and (3) together imply SVD identifies correct NN in uncorrupted data even in a regime where it is different from NN in noisy data. This was not the case in existing literature (see e.g. (Abdullah et al., 2014)). Another comparison with (Abdullah et al., 2014) is that it requires σ\sigma to be at least an inverse polynomial in the ambient dimension dd. The proof of (1) above uses upper bounds on perturbations of singular spaces of matrices as well as concentration and spherical symmetry of Gaussians. We thus give theoretical justification for the performance of spectral methods in practice. We also provide empirical results on real datasets to corroborate our findings.