NeurIPS2023

Multi-task Representation Learning for Pure Exploration in Bilinear Bandits

Subhojyoti Mukherjee, Qiaomin Xie, Josiah Hanna, Robert D. Nowak

被引用 8 次

摘要

We study multi-task representation learning for the problem of pure exploration in bilinear bandits. In bilinear bandits, an action takes the form of a pair of arms from two different entity types and the reward is a bilinear function of the known feature vectors of the arms. In the multi-task bilinear bandit problem, we aim to find optimal actions for multiple tasks that share a common low-dimensional linear representation. The objective is to leverage this characteristic to expedite the process of identifying the best pair of arms for all tasks. We propose the algorithm GOBLIN that uses an experimental design approach to optimize sample allocations for learning the global representation as well as minimize the number of samples needed to identify the optimal pair of arms in individual tasks. To the best of our knowledge, this is the first study to give sample complexity analysis for pure exploration in bilinear bandits with shared representation. Our results demonstrate that by learning the shared representation across tasks, we achieve significantly improved sample complexity compared to the traditional approach of solving tasks independently. 2) Can we design an algorithm for multi-task pure exploration bilinear bandit problem that can learn the latent features and has sample complexity that scales as O(M (k1 + k2)r/∆ 2 )? In this paper, we answer both these questions affirmatively. In doing so, we make the following novel contributions to the growing literature of multi-task representation learning in online settings: 1) We formulate the multi-task bilinear representation learning problem. To our knowledge, this is the first work that explores pure exploration in a multi-task bilinear representation learning setting. 2) We proposed the algorithm GOBLIN for a single-task pure exploration bilinear bandit setting whose sample complexity scales as O((d 1 + d 2 )r/∆ 2 ). This improves over RAGE (Fiez et al., 2019) whose sample complexity scales as O((d 1 d 2 )/∆ 2 ). 3) Our algorithm GOBLIN for multi-task pure exploration bilinear bandit problem learns the latent features and has sample complexity that scales as O(M (k 1 + k 2 )r/∆ 2 ). This improves over DouExpDes (Du et al., 2023) whose samples complexity scales as O(M (k 1 k 2 )/∆ 2 ). Preliminaries: We assume that ∥x∥ 2 ≤ 1, ∥z∥ 2 ≤ 1, ∥Θ * ∥ F ≤ S 0 and the r-th largest singular value of Θ * ∈ R d1×d2 is S r . Let p := d 1 d 2 denote the ambient dimension, and k = (d 1 + d 2 )r denote the effective dimension. Let [n] := 1, 2, . . . , n. Let x * , z * := arg max x,z x ⊤ Θ * z. For any x, z define the gap ∆(x, z) := x ⊤ * Θ * z * -x ⊤ Θ * z and furthermore ∆ = min x̸ =x * ,z̸ =z * ∆(x, z). Similarly, for any arbitrary vector w ∈ W define the gap of w ∈ R p as ∆(w) := (w * -w) ⊤ θ * , for some θ * ∈ R p and furthermore, ∆ = min w̸ =w * ∆(w). If A ∈ R d×d ≥0 is a positive semidefinite matrix, and w ∈ R p is a vector, let ∥w∥ 2 A := w ⊤ Aw denote the induced semi-norm. Given any vector b ∈ R |W| we denote the w-th component as b w . Let ∆ W := b ∈ R |W| : b w ≥ 0, w∈W b w = 1 denote the set of probability distributions on W. We define Y(W) = w -w ′ : ∀w, w ′ ∈ W, w ̸ = w ′ as the directions obtained from the differences between each pair of arms and Y * (W) = w * -w : ∀w ∈ W* as the directions obtained from the differences between the optimal arm and each suboptimal arm.