ICLR2026

Breaking the Total Variance Barrier: Sharp Sample Complexity for Linear Heteroscedastic Bandits with Fixed Action Set

Heyang Zhao, Tianyuan Jin, Weixin Wang, Vincent Y. F. Tan, Pan Xu, Quanquan Gu

Abstract

Recent years have witnessed increasing interests in tackling heteroscedastic noise in bandits and reinforcement learning . In these works, the cumulative variance of the noise Λ=t=1Tσt2\Lambda = \sum_{t=1}^T \sigma_t^2, where σt2\sigma_t^2 is the variance of the noise at round tt, has been used to characterize the statistical complexity of the problem, yielding simple regret bounds of order O~(dΛ/T2)\tilde{\mathcal O}(d \sqrt{\Lambda / T^2}) for linear bandits with heteroscedastic noise . However, with a closer look, Λ\Lambda remains the same order even if the noise is close to zero at half of the rounds, which indicates that the Λ\Lambda-dependence is not optimal.

In this paper, we revisit the linear bandit problem with heteroscedastic noise. We consider the setting where the action set is fixed throughout the learning process. We propose a novel variance-adaptive algorithm VAEE (Variance-Aware Exploration with Elimination) for large action set, which actively explores actions that maximizes the information gain among a candidate set of actions that are not eliminated. With the active-exploration strategy, we show that VAEE achieves a simple regret with a nearly harmonic-mean dependent rate, i.e. O~(d[t=1T1σt2i=1O~(d)1[σ(i)]2]12)\tilde{\mathcal O}\Big(d\Big[\sum_{t = 1}^T \frac{1}{\sigma_t^2} - \sum_{i = 1}^{\tilde{O}(d)} \frac{1}{[\sigma^{(i)}]^2} \Big]^{-\frac{1}{2}}\Big) where dd is the dimension of the feature space and σ(i)\sigma^{(i)} is the ii-th smallest variance among σtt=1T\\{\sigma_t\\}_{t=1}^T. For finitely many actions, we propose a variance-aware variant of G-optimal design based exploration, which achieves a O~\tilde {\mathcal O} (dlogA[_t=1Tˆ1σ_t2ˆ_i=1O~(d)1[σ(i)]2]12)\bigg(\sqrt{d \log |\mathcal A| }\Big[ \sum\_{t = 1}\^T \frac{1}{\sigma\_t\^2}- \sum\_{i = 1}^{\tilde{O}(d)} \frac{1}{[\sigma^{(i)}]^2} \Big]^{-\frac{1}{2}}\bigg) simple regret bound. We also establish a nearly matching lower bound for the fixed action set setting indicating that harmonic-mean dependent rate is unavoidable. To the best of our knowledge, this is the first work that breaks the Λ\sqrt{\Lambda} barrier for linear bandits with heteroscedastic noise.