NeurIPS2020

Model Selection in Contextual Stochastic Bandit Problems

Aldo Pacchiano, My Phan, Yasin Abbasi-Yadkori, Anup Rao, Julian Zimmert, Tor Lattimore, Csaba Szepesvári

被引用 103 次

摘要

We study bandit model selection in stochastic environments. Our approach relies on a meta-algorithm that selects between candidate base algorithms. We develop a meta-algorithm-base algorithm abstraction that can work with general classes of base algorithms and different type of adversarial meta-algorithms. Our methods rely on a novel and generic smoothing transformation for bandit algorithms that permits us to obtain optimal O( √ T ) model selection guarantees for stochastic contextual bandit problems as long as the optimal base algorithm satisfies a high probability regret guarantee. We show through a lower bound that even when one of the base algorithms has O(log T ) regret, in general it is impossible to get better than Ω( √ T ) regret in model selection, even asymptotically. Using our techniques, we address model selection in a variety of problems such as misspecified linear contextual bandits [LSW20], linear bandit with unknown dimension [FKL19] and reinforcement learning with unknown feature maps. Our algorithm requires the knowledge of the optimal base regret to adjust the meta-algorithm learning rate. We show that without such prior knowledge any meta-algorithm can suffer a regret larger than the optimal base regret.