AAAI2023
Double Doubly Robust Thompson Sampling for Generalized Linear Contextual Bandits
Wonyoung Kim, Kyungbok Lee, Myunghee Cho Paik
被引用 19 次
摘要
We propose a novel algorithm for generalized linear contextual bandits (GLBs) with an Õ( κ -1 φ -1 T ) regret over T rounds where φ is the minimum eigenvalue of the covariance of contexts and κ is a lower bound of the variance of rewards. In several identified cases of φ -1 = O(d), where d is the dimension of contexts, our result is the first regret bound for generalized linear bandits (GLBs) achieving the order √ d without discarding the observed rewards. Previous approaches achieve the regret bound of order √ d by discarding the observed rewards, whereas our algorithm achieves the bound incorporating contexts from all arms in our double doubly-robust (DDR) estimator. The DDR estimator is a subclass of doubly-robust estimator but with a tighter error bound. We also provide an O(κ -1 φ -1 log(N T ) log T ) regret bound for N arms under a probabilistic margin condition. This is the first regret bound under the margin condition for linear models or GLMs when contexts are different for all arms but coefficients are common. We conduct empirical studies using synthetic data and real examples, demonstrating the effectiveness of our algorithm.