ICLR2025

Breaking the log⁡(1/Δ2) Barrier: Better Batched Best Arm Identification with Adaptive Grids

Tianyuan Jin, Qin Zhang, Dongruo Zhou

摘要

We consider the problem of best arm identification in the multi-armed bandit model, under fixed confidence. Given a confidence input δ, the goal is to identify the arm with the highest mean reward with a probability of at least 1 -δ, while minimizing the number of arm pulls. While the literature provides solutions to this problem under the assumption of independent arms distributions, we propose a more flexible scenario where arms can be dependent and rewards can be sampled simultaneously. This framework allows the learner to estimate the covariance among the arms distributions, enabling a more efficient identification of the best arm. The relaxed setting we propose is relevant in various applications, such as clinical trials, where similarities between patients or drugs suggest underlying correlations in the outcomes. We introduce new algorithms that adapt to the unknown covariance of the arms and demonstrate through theoretical guarantees that substantial improvement can be achieved over the standard setting. Additionally, we provide new lower bounds for the relaxed setting and present numerical simulations that support their theoretical findings. Introduction and setting Best arm identification (BAI) is a sequential learning and decision problem that refers to finding the arm with the largest mean (average reward) among a finite number of arms in a stochastic multi-armed bandit (MAB) setting. An MAB model ν is a set of K distributions in R: ν 1 , . . . , ν K , with means µ 1 , . . . , µ K . An "arm" is identified with the corresponding distribution index. The observation consists in sequential draws ("queries" or "arm pulls") from these distributions, and each such outcome is a "reward". The learner's goal is to identify the optimal arm i * := ArgMax i∈ K µ i , efficiently. There are two main variants of BAI problems: The fixed budget setting [1, 8] , where given a fixed number of queries T , the learner allocates queries to candidates arms and provides a guess for the optimal arm. The theoretical guarantee in this case takes the form of an upper bound on the probability p T of selecting a sub-optimal arm. The second variant is the fixed confidence setting [14, 20] , where a confidence parameter δ ∈ (0, 1) is given as an input to the learner, and the objective is to output an arm ψ ∈ K , such that P(ψ = i * ) ≥ 1 -δ, using the least number of arm pulls . The complexity in this case corresponds to the total number of queries made before the algorithm stops and gives a guess for the best arm that is valid with probability at least 1 -δ according to a specified stopping rule. In this paper we specifically focus on the fixed confidence setting.