NeurIPS2023

Practical Contextual Bandits with Feedback Graphs

Mengxiao Zhang, Yuheng Zhang, Olga Vrousgou, Haipeng Luo, Paul Mineiro

10 citations

Abstract

While contextual bandit has a mature theory, effectively leveraging different feedback patterns to enhance the pace of learning remains unclear. Bandits with feedback graphs, which interpolates between the full information and bandit regimes, provides a promising framework to mitigate the statistical complexity of learning. In this paper, we propose and analyze an approach to contextual bandits with feedback graphs based upon reduction to regression. The resulting algorithms are computationally practical and achieve established minimax rates, thereby reducing the statistical complexity in real-world applications. Introduction This paper is primarily concerned with increasing the pace of learning for contextual bandits [Auer et al., 2002, Langford and Zhang, 2007] . While contextual bandits have enjoyed broad applicability [Bouneffouf et al., 2020] , the statistical complexity of learning with bandit feedback imposes a data lower bound for application scenarios [Agarwal et al., 2012] . This has inspired various mitigation strategies, including exploiting function class structure for improved experimental design [Zhu and Mineiro, 2022], and composing with memory for learning with fewer samples [Rucker et al., 2022] . In this paper we exploit alternative graph feedback patterns to accelerate learning: intuitively, there is no need to explore a potentially suboptimal action if a presumed better action, when exploited, yields the necessary information. The framework of bandits with feedback graphs is mature and provides a solid theoretical foundation for incorporating additional feedback into an exploration strategy [Mannor and Shamir, 2011 , Alon et al., 2015 , 2017]. Succinctly, in this framework, the observation of the learner is decided by a directed feedback graph G: when an action is played, the learner observes the loss of every action to which the chosen action is connected. When the graph only contains self-loops, this problem reduces to the classic bandit case. For non-contextual bandits with feedback graphs, [Alon et al., 2015] provides a full characterization on the minimax regret bound with respect to different graph theoretic quantities associated with G according to the type of the feedback graph. However, contextual bandits with feedback graphs have received less attention [Singh et al., 2020 , Wang et al., 2021] . Specifically, there is no prior work offering a solution for general feedback graphs * Equal contribution. 37th Conference on Neural Information Processing Systems (NeurIPS 2023).