NeurIPS2023

Context-lumpable stochastic bandits

Chung-Wei Lee, Qinghua Liu, Yasin Abbasi-Yadkori, Chi Jin, Tor Lattimore, Csaba Szepesvári

被引用 2 次

摘要

We consider a contextual bandit problem with S contexts and K actions. In each round t = 1, 2, . . . the learner observes a random context and chooses an action based on its past experience. The learner then observes a random reward whose mean is a function of the context and the action for the round. Under the assumption that the contexts can be lumped into r ≤ minS, K groups such that the mean reward for the various actions is the same for any two contexts that are in the same group, we give an algorithm that outputs an ε-optimal policy after using at most O(r(S + K)/ε 2 ) samples with high probability and provide a matching Ω(r(S + K)/ε 2 ) lower bound.In the regret minimization setting, we give an algorithm whose cumulative regret up to time T is bounded by O( r 3 (S + K)T ). To the best of our knowledge, we are the first to show the near-optimal sample complexity in the PAC setting and O( poly(r)(S + K)T ) minimax regret in the online setting for this problem. We also show our algorithms can be applied to more general low-rank bandits and get improved regret bounds in some scenarios. * most works were done when interning at DeepMind. 37th Conference on Neural Information Processing Systems (NeurIPS 2023).