NeurIPS2022
Minimax Regret for Cascading Bandits
Daniel Vial, Sujay Sanghavi, Sanjay Shakkottai, R. Srikant
16 citations
Abstract
Cascading bandits is a natural and popular model that frames the task of learning to rank from Bernoulli click feedback in a bandit setting. For the case of unstructured rewards, we prove matching upper and lower bounds for the problem-independent (i.e., gap-free) regret, both of which strictly improve the best known. A key observation is that the hard instances of this problem are those with small mean rewards, i.e., the small click-through rates that are most relevant in practice. Based on this, and the fact that small mean implies small variance for Bernoullis, our key technical result shows that variance-aware confidence sets derived from the Bernstein and Chernoff bounds lead to optimal algorithms (up to log terms), whereas Hoeffding-based algorithms suffer order-wise suboptimal regret. This sharply contrasts with the standard (non-cascading) bandit setting, where the variance-aware algorithms only improve constants. In light of this and as an additional contribution, we propose a variance-aware algorithm for the structured case of linear rewards and show its regret strictly improves the state-of-the-art. * These papers contain minimax lower bounds but for click models that are distinct from cascading bandits. † These bounds assume L is large compared to K and n is large compared to L and K. ‡ This bound holds for a Bayesian notion of regret and includes a dependence on the prior not shown above. § These bounds assume n is large compared to d and K to suppress some additive o( √ n) terms. namely, w(e) ≤ 1/K, where the variance w(e)(1 -w(e)) is also small. We emphasize that these worst-case instances are not pathological; rather, they model the low click-through rates that prevail in practice [Richardson et al., 2007] . Further, we argue that adapting to low variance is crucial to cope with the worst-case instances. Put differently, algorithms should be variance-aware, i.e., more exploitative when the variance is small. We provide the intuition behind this key insight in Section 1.2 and show how to formalize it with a proof sketch in Section 5.