NeurIPS2020

Restless-UCB, an Efficient and Low-complexity Algorithm for Online Restless Bandits

Siwei Wang, Longbo Huang, John C. S. Lui

被引用 52 次

摘要

We study the online restless bandit problem, where the state of each arm evolves according to a Markov chain, and the reward of pulling an arm depends on both the pulled arm and the current state of the corresponding Markov chain. In this paper, we propose Restless-UCB, a learning policy that follows the explore-then-commit framework. In Restless-UCB, we present a novel method to construct offline instances, which only requires O(N ) time-complexity (N is the number of arms) and is exponentially better than the complexity of existing learning policy. We also prove that Restless-UCB achieves a regret upper bound of Õ((N + M 3 )T 2 3 ), where M is the Markov chain state space size and T is the time horizon. Compared to existing algorithms, our result eliminates the exponential factor (in M, N ) in the regret upper bound, due to a novel exploitation of the sparsity in transitions in general restless bandit problems. As a result, our analysis technique can also be adopted to tighten the regret bounds of existing algorithms. Finally, we conduct experiments based on real-world dataset, to compare the Restless-UCB policy with state-of-the-art benchmarks. Our results show that Restless-UCB outperforms existing algorithms in regret, and significantly reduces the running time. • We show that Restless-UCB can be combined with an efficient offline approximation oracle to guarantee O(N ) time-complexity and an Õ(T 3 ) approximation regret upper bound. Note that existing algorithms suffer from either an exponential complexity or no theoretical guarantee even with an efficient approximation oracle. • We conduct experiments based on real-world datasets, and compare our policy with existing benchmarks. Our results show that Restless-UCB outperforms existing algorithms in both regret and running time. Consider an online restless bandit problem R which has one player (decision maker) and N arms (actions) 1, • • • , N . Each arm i ∈ 1, • • • , N is associated with a Markov chain M i . All the Markov chains M i , i = 1, 2, ..., N have the same state space S = 1, 2, • • • , M , 1 but may have different transition matrices P i , i = 1, 2, ..., N and state-dependent rewards r(i, s), ∀ i, s that 1 This is not restrictive and is only used to simplify notations. Our analysis still works in the case where the state space Si of Markov chain Mi satisfies that |Si| ≤ M .