ICLR2025

Second Order Bounds for Contextual Bandits with Function Approximation

Aldo Pacchiano

Abstract

Many works have developed no-regret algorithms for contextual bandits with function approximation, where the mean rewards over context-action pairs belong to a function class F. Although there are many approaches to this problem, algorithms based on the principle of optimism, such as optimistic least squares have gained in importance. The regret of optimistic least squares scales as r O ´ad eluder pFq logpFqT where d eluder pFq is a statistical measure of the complexity of the function class F known as eluder dimension. Unfortunately, even if the variance of the measurement noise of the rewards at time t equals σ 2 t and these are close to zero, the optimistic least squares algorithm's regret scales with ? T . In this work we are the first to develop algorithms that satisfy regret bounds for contextual bandits with function approximation of the form r O ´σa logpFqd eluder pFqT deluder pFq ¨logp|F|q ¯when the variances are unknown and satisfy σ 2 t " σ for all t and r O ˆdeluder pFq b logpFq ř T t"1 σ 2 t deluder pFq ¨logp|F|q ẇhen the variances change at every time-step. These bounds generalize existing techniques for deriving second order bounds in contextual linear problems.