ICLR2026

Variance-Dependent Regret Lower Bounds for Contextual Bandits

Jiafan He, Quanquan Gu

4 citations

Abstract

Variance-dependent regret bounds for linear contextual bandits, which improve upon the classical O~(dK)\tilde{O}(d\sqrt{K}) regret bound to O~(dk=1Kσk2)\tilde{O}(d\sqrt{\sum_{k=1}^K\sigma_k^2}), where dd is the context dimension, KK is the number of rounds, and σk2\sigma^2_k is the noise variance in round kk, has been widely studied in recent years. However, most existing works focus on the regret upper bounds instead of lower bounds. To our knowledge, the only lower bound is from Jia et al. (2024), which proved that for any eluder dimension delud_{\textbf{elu}} and total variance budget Λ\Lambda, there exists an instance with k=1Kσk2Λ\sum_{k=1}^K\sigma_k^2\leq \Lambda for which any algorithm incurs a variance-dependent lower bound of Ω(deluΛ)\Omega(\sqrt{d_{\textbf{elu}}\Lambda}). However, this lower bound has a d\sqrt{d} gap with existing upper bounds. Moreover, it only considers a fixed total variance budget Λ\Lambda and does not apply to a general variance sequence {σ12,,σK2}\{\sigma_1^2,\ldots,\sigma_K^2\}. In this paper, to overcome the limitations of Jia et al. (2024), we consider the general variance sequence under two settings. For a prefixed sequence, where the entire variance sequence is revealed to the learner at the beginning of the learning process, we establish a variance-dependent lower bound of Ω(dk=1Kσk2/logK)\Omega(d \sqrt{\sum_{k=1}^K\sigma_k^2 }/\log K) for linear contextual bandits. For an adaptive sequence, where an adversary can generate the variance σk2\sigma_k^2 in each round kk based on historical observations, we show that when the adversary must generate σk2\sigma_k^2 before observing the decision set DkD_k, a similar lower bound of Ω(dk=1Kσk2/log6(dK))\Omega(d\sqrt{ \sum_{k=1}^K\sigma_k^2} /\log^6(dK)) holds. In both settings, our results match the upper bounds of the SAVE algorithm (Zhao et al. 2023) up to logarithmic factors. Furthermore, if the adversary can generate the variance σk\sigma_k after observing the decision set DkD_k, we construct a counter-example showing that it is impossible to construct a variance-dependent lower bound if the adversary properly selects variances in collaboration with the learner. Our lower bound proofs use a novel peeling technique that groups rounds by variance magnitude. For each group, we construct separate instances and assign the learner distinct decision sets. We believe this proof technique may be of independent interest.