NeurIPS2022

When Privacy Meets Partial Information: A Refined Analysis of Differentially Private Bandits

Achraf Azize, Debabrota Basu

被引用 29 次

摘要

We study the problem of multi-armed bandits with ǫ-global Differential Privacy (DP). First, we prove the minimax and problem-dependent regret lower bounds for stochastic and linear bandits that quantify the hardness of bandits with ǫ-global DP. These bounds suggest the existence of two hardness regimes depending on the privacy budget ǫ. In the high-privacy regime (small ǫ), the hardness depends on a coupled effect of privacy and partial information about the reward distributions. In the low-privacy regime (large ǫ), bandits with ǫ-global DP are not harder than the bandits without privacy. For stochastic bandits, we further propose a generic framework to design a near-optimal ǫ-global DP extension of an indexbased optimistic bandit algorithm. The framework consists of three ingredients: the Laplace mechanism, arm-dependent adaptive episodes, and usage of only the rewards collected in the last episode for computing private statistics. Specifically, we instantiate ǫ-global DP extensions of UCB and KL-UCB algorithms, namely AdaP-UCB and AdaP-KLUCB. AdaP-KLUCB is the first algorithm that both satisfies ǫ-global DP and yields a regret upper bound that matches the problemdependent lower bound up to multiplicative constants. from a low to a high privacy regime (Thm. 7 and 8). Both theoretical (Table 1 ) and experimental (Sec. 5) results demonstrate optimality of AdaP-KLUCB than existing algorithms. 3. Technical Tools: (a) To derive the lower bound, we extend the Karwa-Vadhan lemma (Lemma 6.1, (Karwa and Vadhan, 2017)) to the sequential setting (Lemma 2). (b) We present a novel sequential information processing lemma under ǫ-global DP (Thm. 10) that controls the difference between the outcome streams of a differentially private policy when interacting with two different bandit instances. (c) This leads to a generic proof structure, utilised to generate refined regret lower bounds under ǫ-global DP for different settings. (d) To derive the regret upper bound, we develop a novel and general analysis of optimistic algorithms with adaptive episodes. Our regret lower and upper bounds close the open question posed in (Tenenbaum et al., 2021), i.e. the problem-dependent regret bounds of differentially private bandits should depend on the KLdivergence between the reward distributions. Related Works: Lower Bound. (Shariff and Sheffet, 2018) first proposed a problem-dependent lower bound on regret, Ω(max a =a * log T ∆a , K log(T ) ǫ ) 2 , for stochastic bandits with ǫ-global DP. But this bound is restricted to Bernoulli distributions of reward. In this paper, we provide the first problem-dependent regret lower bound that is valid for any reward distribution. This lower bound shows that the effect of privacy does not only depend on ǫ but also on the total variation distance corresponding to the environment. Thus, it explicates the coupled effect of privacy and partial information in a high-privacy regime, which was not observed before. To derive this tighter lower bound, we extend the Karwa-Vadhan lemma (Lemma 6.1, (Karwa and Vadhan, 2017)) to a sequential setting and also propose a generic proof structure leading to problem-dependent and minimax lower bounds for stochastic and linear bandits. To our knowledge, these are the first regret lower bounds for linear bandits with ǫ-global DP. Basu et al. ( 2019 ) also proposed a minimax regret lower bound for stochastic bandits with ǫ-global DP. However, our minimax lower bound (Thm. 2) is tighter and does not need to assume that the reward distributions are Lipschitz continuous. Related Works: Bandit Algorithms with ǫ-global DP. DP-UCB (Mishra and Thakurta, 2015; Tossou and Dimitrakakis, 2016) was the first global DP version of UCB. DP-UCB uses the treebased mechanism (Dwork et al., 2010a; Chan et al., 2011) to compute the sum of rewards. The tree mechanism maintains a binary tree of depth log(T ) over the T streaming observations, where each node in this tree holds an i.i.d sample from a Laplace distribution with zero mean and scale (log(T )/ǫ). At each step t, the mechanism yields the sum of the first t observations and the log(T ) nodes on the root-to-the-leaf path in the binary tree as the private empirical mean. As a result, the noise added to the UCB index per time-step is O log(T ) 2.5 /ǫ , which is responsible for the extra multiplicative factor log(T ) 1.5 in regret compared to the lower bound. DP-SE (Sajed and Sheffet, 2019) was the first ǫ-global DP algorithm to eliminate the additional multiplicative factor log(T ) 1.5 . DP-SE is an ǫ-global DP version of the Successive Elimination algorithm (Even-Dar et al., 2002) . However, the drawbacks were that the algorithm was not anytime, and was optimal only asymptotically, i.e. the horizon should be big enough. On the other hand, a careful analysis of the algorithm suggests that what made the algorithm optimal was the fact that the algorithm was run in doubling episodes, where the private means computed at the end of the e