NeurIPS2023
Online robust non-stationary estimation
Abishek Sankararaman, Balakrishnan Narayanaswamy
被引用 1 次
摘要
The real-time estimation of time-varying parameters from high-dimensional, heavytailed and corrupted data-streams is a common sub-routine in systems ranging from those for network monitoring and anomaly detection to those for traffic scheduling in data-centers. For estimation tasks that can be cast as minimizing a strongly convex loss function, we prove that an appropriately tuned version of the clipped Stochastic Gradient Descent (SGD) is simultaneously (i) adaptive to drift, (ii) robust to heavy-tailed inliers and arbitrary corruptions, (iii) requires no distributional knowledge and (iv) can be implemented in an online streaming fashion. All prior estimation algorithms have only been proven to posses a subset of these practical desiderata. A observation we make is that, neither the O 1 t learning rate for clipped SGD known to be optimal for strongly convex loss functions of a stationary data-stream, nor the O(1) learning rate known to be optimal for being adaptive to drift in a noiseless environment can be used. Instead, a learning rate of T -α for α < 1 where T is the stream-length is needed to balance adaptivity to potential drift and to combat noise. We develop a new inductive argument and combine it with a martingale concentration result to derive high-probability under any learning rate on data-streams exhibiting arbitrary distribution shift -a proof strategy that may be of independent interest. Further, using the classical doubling-trick, we relax the knowledge of the stream length T . Ours is the first online estimation algorithm that is provably robust to heavy-tails, corruptions and distribution shift simultaneously. We complement our theoretical results empirically on synthetic and real data. 1 We denote by [T ] := 1, • • • , T 2 Throughout, we denote by ∥ • ∥ as the L2 norm operator. 3 Convexity implies existence and uniqueness of θ * t ∈ Θ 4 In the literature, this is sometimes also denoted as an optimization algorithm. c.f.