KDD2023

OPORP: One Permutation + One Random Projection

Ping Li, Xiaoyun Li

1 citation

Abstract

OPORP is an improved variant of the count-sketch data structure by using a fixed-length binning scheme and a normalization step for the estimation. In our experience, we find engineers like the name "one permutation + one random projection" as it tells the exact steps. Consider two data vectors (e.g., embeddings): u, v ∈ R D . In many embedding-based applications where vectors are generated from trained models, D = 256 ∼ 1024 are common and D > 1024 is not rare (e.g., GPT models). D can be much larger in applications where the vectors are generated without training. With OPORP, we first apply a permutation on the data vectors. A random vector r ∈ R D is generated with moments: Note that s = 3 if r i follows the standard Gaussian distribution. We multiply r (element-wise) with all permuted data vectors. Then we break the D columns into k equal-length bins and aggregate (i.e., sum) the values in each bin to obtain k samples from each data vector. One key step is to normalize the k samples to the unit l 2 norm. In this way, for the two original data vectors u, v ∈ R D , we obtain two new vectors x, y ∈ R D with unit l 2 norms. The inner product of x, y approximates the original correlation ρ (i.e., the cosine) between u and v. Our main contribution is to show that the estimation variance has essentially the following expression: (s -1)A + D -k D -1