NeurIPS2021

Coresets for Decision Trees of Signals

Ibrahim Jubran, Ernesto Evgeniy Sanches Shayda, Ilan Newman, Dan Feldman

23 citations

Abstract

A kk-decision tree tt (or kk-tree) is a recursive partition of a matrix (2D-signal) into k1k\geq 1 block matrices (axis-parallel rectangles, leaves) where each rectangle is assigned a real label. Its regression or classification loss to a given matrix DD of NN entries (labels) is the sum of squared differences over every label in DD and its assigned label by tt. Given an error parameter ε(0,1)\varepsilon\in(0,1), a (k,ε)(k,\varepsilon)-coreset CC of DD is a small summarization that provably approximates this loss to every such tree, up to a multiplicative factor of 1±ε1\pm\varepsilon. In particular, the optimal kk-tree of CC is a (1+ε)(1+\varepsilon)-approximation to the optimal kk-tree of DD. We provide the first algorithm that outputs such a (k,ε)(k,\varepsilon)-coreset for every such matrix DD. The size C|C| of the coreset is polynomial in klog(N)/εk\log(N)/\varepsilon, and its construction takes O(Nk)O(Nk) time. This is by forging a link between decision trees from machine learning -- to partition trees in computational geometry. Experimental results on sklearn and lightGBM show that applying our coresets on real-world data-sets boosts the computation time of random forests and their parameter tuning by up to x1010, while keeping similar accuracy. Full open source code is provided.