NeurIPS2022

Bezier Gaussian Processes for Tall and Wide Data

Martin Jørgensen, Michael A. Osborne

被引用 2 次

摘要

Modern approximations to Gaussian processes are suitable for "tall data", with a cost that scales well in the number of observations, but under-performs on "wide data", scaling poorly in the number of input features. That is, as the number of input features grows, good predictive performance requires the number of summarising variables, and their associated cost, to grow rapidly. We introduce a kernel that allows the number of summarising variables to grow exponentially with the number of input features, but requires only linear cost in both number of observations and input features. This scaling is achieved through our introduction of the Bézier buttress, which allows approximate inference without computing matrix inverses or determinants. We show that our kernel has close similarities to some of the most used kernels in Gaussian process regression, and empirically demonstrate the kernel's ability to scale to both tall and wide datasets. Gaussian processes (GPs) are a probabilistic approach to modelling functions that permit tractable Bayesian inference. They are, however, notorious for their poor scalability. In recent decades, this criticism has been challenged. Several approximate methods now allow GPs to scale to millions of data points. Yet, scalability in the number of data points is merely one challenge of big data. There are still problems associated with the input dimensionality -one aspect of the famed curse of dimensionality. Burt et al. [2020] analysed the most studied approximation, the so-called sparse inducing points methods, and showed it to be accurate for low dimensional inputs. Alarmingly, exponentially many inducing points are still needed in high-dimensional input spaces, that is, for problems with a large number of features. As such, despite modern GP approximations scaling to tall data, they are still discounted when concerning wide data. In response to this, there exist GP approximations built on simplices or grid-structures in the input space [Wilson and Nickisch, 2015 , Gardner et al., 2018 , Kapoor et al., 2021] . These take advantage of attractive fast linear algebra, but are often limited by memory in higher dimensions. Their advantage is the ability to fill the input space with structured points, so all observations have a close neighbour. We propose a new kernel for GP regression that requires neither matrix inversion nor determinant calculation -GPs' two core computational sinners. Additionally, we cover the input space 1 with exponentially many points, but introduce an approximation that grows only linearly in computational complexity. That is, our method scales linearly in both the number of data points and the number of input dimensions, whilst being space-filling in the input domain. GPs are indispensable to fields where uncertainty is a driver in decision-making mechanisms. Such fields include Bayesian optimisation, active learning and reinforcement learning. The critical decision mechanism is the exploration-exploitation trade-off. One ability useful in such fields is to assign high uncertainty to unexplored regions, just as does an exact GP. We show that our proposed model also assigns high uncertainty to unexplored regions, suggesting our model as well-suited to decisionmaking problems. 1 A limiting assumption of our kernel is its restriction to a box-bounded domain in the input space. 36th Conference on Neural Information Processing Systems (NeurIPS 2022).