NeurIPS2023
Noise-Adaptive Thompson Sampling for Linear Contextual Bandits
Ruitu Xu, Yifei Min, Tianhao Wang
被引用 12 次
摘要
Linear contextual bandits represent a fundamental class of models with numerous real-world applications, and it is critical to developing algorithms that can effectively manage noise with unknown variance, ensuring provable guarantees for both worst-case constant-variance noise and deterministic reward scenarios. In this paper, we study linear contextual bandits with heteroscedastic noise and propose the first noise-adaptive Thompson sampling-style algorithm that achieves a variance-dependent regret upper bound of (cid:101) O (cid:16) d 3 / 2 + d 3 / 2 (cid:113)(cid:80) Tt =1 σ 2 t (cid:17) , where d is the dimension of the context vectors and σ 2 t is the variance of the reward in round t . This recovers the existing (cid:101) O ( d 3 / 2 √ T ) regret guarantee in the constant-variance regime and further improves to (cid:101) O ( d 3 / 2 ) in the deterministic regime, thus achieving a smooth interpolation in between. Our approach utilizes a stratified sampling procedure to overcome the too-conservative optimism in the linear Thompson sampling algorithm for linear contextual bandits.