NeurIPS2021

Metadata-based Multi-Task Bandits with Bayesian Hierarchical Models

Runzhe Wan, Lin Ge, Rui Song

31 citations

Abstract

How to explore efficiently is a central problem in multi-armed bandits. In this paper, we introduce the metadata-based multi-task bandit problem, where the agent needs to solve a large number of related multi-armed bandit tasks and can leverage some task-specific features (i.e., metadata) to share knowledge across tasks. As a general framework, we propose to capture task relations through the lens of Bayesian hierarchical models, upon which a Thompson sampling algorithm is designed to efficiently learn task relations, share information, and minimize the cumulative regrets. Two concrete examples for Gaussian bandits and Bernoulli bandits are carefully analyzed. The Bayes regret for Gaussian bandits clearly demonstrates the benefits of information sharing with our algorithm. The proposed method is further supported by extensive experiments. How to explore efficiently is a central problem in MAB. In many modern applications, we usually have a large number of separate but related MAB tasks. For example, in e-commerce, the company needs to find the optimal display mode for each of many products, and in medical applications, each patient needs to individually undergo a series of treatment periods to find the personalized optimal treatment. These tasks typically share similar problem structures, but may have different reward distributions. Intuitively, appropriate information sharing can largely speed up our learning process and reduce the regret, while a naive pooling may cause a linear regret due to the bias. This paper is concerned with the following question: given a large number of MAB tasks, how do we efficiently share information among them? The central task is to capture task relations in a principled way [62] . While a few approaches have been proposed (see Section 2), most of them only utilize the action-reward pairs observed for each task. To our knowledge, none of the existing works can leverage another valuable information source, i.e., the metadata of each task, which contains some static task-specific features. Such metadata is commonly available in real applications [60, 62, 52] , such as the demographic features of each customer or patient, or basic information of each web page or product. Although the metadata can hardly be directly utilized in a single MAB task, it usually contains intrinsic information about each task, and hence can guide us to learn task relations and efficiently share knowledge in the multi-task setup. Specifically, suppose task i has a feature vector x i (i.e., metadata) and its expected reward vector for all arms is r i . We consider a general formulation that r i is sampled from the conditional distribution P(r i |x i ). When P(r i |x i ) is Preprint. Under review.