ICML2023

Smooth Non-stationary Bandits

Su Jia, Qian Xie, Nathan Kallus, Peter I. Frazier

被引用 14 次

摘要

In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time. However, in practice, environments often change smoothly, so such algorithms may incur higher-than-necessary regret. We study a non-stationary bandits problem where each arm's mean reward sequence can be embedded into a β\beta-Hölder function, i.e., a function that is (β1)(\beta-1)-times Lipschitz-continuously differentiable. The non-stationarity becomes more smooth as β\beta increases. When β=1\beta=1, this corresponds to the non-smooth regime, where established a minimax regret of Θ~(T2/3)\tilde \Theta(T^{2/3}). We show the first separation between the smooth (i.e., β2\beta\ge 2) and non-smooth (i.e., β=1\beta=1) regimes by presenting a policy with O~(k4/5T3/5)\tilde O(k^{4/5} T^{3/5}) regret on any kk-armed, 22-Hölder instance. We complement this result by showing that the minimax regret on the β\beta-Hölder family of instances is Ω(T(β+1)/(2β+1))\Omega(T^{(\beta+1)/(2\beta+1)}) for any integer β1\beta\ge 1. This matches our upper bound for β=2\beta=2 up to logarithmic factors. Furthermore, we validated the effectiveness of our policy through a comprehensive numerical study using real-world click-through rate data.