ICML2020

Combinatorial Pure Exploration for Dueling Bandit

Wei Chen, Yihan Du, Longbo Huang, Haoyu Zhao

14 citations

Abstract

In this paper, we study combinatorial pure exploration for dueling bandits (CPE-DB): we have multiple candidates for multiple positions as modeled by a bipartite graph, and in each round we sample a duel of two candidates on one position and observe who wins in the duel, with the goal of finding the best candidate-position matching with high probability after multiple rounds of samples. CPE-DB is an adaptation of the original combinatorial pure exploration for multiarmed bandit (CPE-MAB) problem to the dueling bandit setting. We consider both the Borda winner and the Condorcet winner cases. For Borda winner, we establish a reduction of the problem to the original CPE-MAB setting and design PAC and exact algorithms that achieve both the sample complexity similar to that in the CPE-MAB setting (which is nearly optimal for a subclass of problems) and polynomial running time per round. For Condorcet winner, we first design a fully polynomial time approximation scheme (FPTAS) for the offline problem of finding the Condorcet winner with known winning probabilities, and then use the FPTAS as an oracle to design a novel pure exploration algorithm CAR-Cond with sample complexity analysis. CAR-Cond is the first algorithm with polynomial running time per round for identifying the Condorcet winner in CPE-DB. c 1 c 2 c 3 c 4 s 1 s 2 e 1 e 2 e 3 e 4 e 5 Figure 1. Graph e 1 e 2 e 3 e 4 e 5 e 1 0.5 0.45 1 0 0 e 2 0.55 0.5 0.55 0 0 e 3 0 0.45 0.5 0 0 e 4 0 0 0 0.5 0 e 5 0 0 0 1 0.5 Figure 2. Preference Matrix c 1 c 2 c 3 c 4 s 1 s 2 e 1 e 2 e 3 e 4 e 5 Figure 3. Borda Winner c 1 c 2 c 3 c 4 s 1 s 2 e 1 e 2 e 3 e 4 e 5 * with score χ T M1 P χ M C * < 1/2, and only when x = χ M C * , min y=χM 2 1 ℓ x T P y reaches 1/2 when y = χ M C * . However, the optimization problem is "discrete" and we first use the