NeurIPS2023

Tracking Most Significant Shifts in Nonparametric Contextual Bandits

Joe Suk, Samory Kpotufe

9 citations

Abstract

We study nonparametric contextual bandits where Lipschitz mean reward functions may change over time. We first establish the minimax dynamic regret rate in this less understood setting in terms of number of changes L and total-variation V , both capturing all changes in distribution over context space, and argue that state-of-the-art procedures are suboptimal in this setting. Next, we tend to the question of an adaptivity for this setting, i.e. achieving the minimax rate without knowledge of L or V . Quite importantly, we posit that the bandit problem, viewed locally at a given context X t , should not be affected by reward changes in other parts of context space X . We therefore propose a notion of change, which we term experienced significant shifts, that better accounts for locality, and thus counts considerably less changes than L and V . Furthermore, similar to recent work on non-stationary MAB [Suk and Kpotufe, 2022] , experienced significant shifts only count the most significant changes in mean rewards, e.g., severe best-arm changes relevant to observed contexts. Our main result is to show that this more tolerant notion of change can in fact be adapted to. We then turn attention to whether such baselines may be achieved adaptively, i.e., without knowledge of L or V . The answer as we show is affirmative, and more importantly, some much weaker notions of change may be adapted to; for intuition, while L or V accounts for any change at any time over the context space (say X ), it may be that all changes are relegated to parts of the space irrelevant to observed contexts X t at the time they are played. For instance, suppose at time t, we observe X t = x 0 , then it may not make sense to count changes that happen at some other x 1 far from x 0 , or changes that happened at x 0 itself but far back in time. We therefore propose a new parameterization of change, termed experienced significant shifts that better accounts for the locality of changes in time and space, and as such may register much less changes than either L or V . As a sanity check, we show that an oracle policy which restarts only at experienced significant shifts can attain enhanced regret rates in terms of the number L = L(X 1 , . . . , X T ) of such experienced shifts (Proposition 2), a rate always no worse that the baseline we first established in terms of L and V . Our main result is to show that experienced significant shifts can be adapted to (Theorem 3), i.e., with no prior knowledge of such shifts. Importantly, the result holds in both stochastic environments, and in (oblivious) adversarial ones with no change to our notion, algorithmic approach, nor analysis. Furthermore, similar to recent advances in the non-contextual case [Abbasi-Yadkori et al., 2022, Suk and Kpotufe, 2022] , an experienced shift is only triggered under severe changes such as changes of best arms locally at a context X t . An added difficulty in the contextual case is that we cannot hope to observe rewards for a given arm (action) repeatedly at X t as the context may only appear once, and have to rely on carefuly chosen nearby points to identify unknown shifts in reward at X t . Other Related Work Nonparametric Contextual Bandits. The stationary bandits with covariates (where rewards and contexts follow a joint distribution) was first introduced in a one-armed bandit problem [Woodroofe, 1979 , Sarkar, 1991] , with the nonparametric model first studied by Yang et al. [2002]. Minimax regret rates, based on a margin condition, were first established for the two-armed bandit in Rigollet and Zeevi [2010] and generalized to any finite number of arms in Perchet and Rigollet [2013], with further insights thereafter [