Algorithms for Differentially Private Multi-Armed Bandits

(1)

HAL Id: hal-01234427

https://hal.inria.fr/hal-01234427

Submitted on 26 Nov 2015

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Algorithms for Differentially Private Multi-Armed Bandits

Aristide Tossou, Christos Dimitrakakis

To cite this version:

Aristide Tossou, Christos Dimitrakakis. Algorithms for Differentially Private Multi-Armed Bandits.

AAAI 2016, Feb 2016, Phoenix, Arizona, United States. �hal-01234427�

(2)

Algorithms for Differentially Private Multi-Armed Bandits

Aristide C. Y. Tossou and Christos Dimitrakakis

Chalmers University of Technology, Gothenburg, Sweden {aristide, chrdimi}@chalmers.se

Abstract

We present differentially private algorithms for the stochastic Multi-Armed Bandit (MAB) problem. This is a problem for applications such as adaptive clinical trials, experiment de- sign, and user-targeted advertising where private information is connected to individual rewards. Our major contribution is to show that there exist(ǫ, δ) differentially private variants of Upper Confidence Bound algorithms which have optimal regret,O(ǫ⁻¹+ logT). This is a significant improvement over previous results, which only achieve poly-log re- gretO(ǫ⁻²log²T), because of our use of a novel interval- based mechanism. We also substantially improve the bounds of previous family of algorithms which use a continual release mechanism. Experiments clearly validate our theoretical bounds.

1 Introduction

The well-known stochastic K-armed bandit problem (Thompson 1933; Robbins and others 1952) involves an agent sequentially choosing among a set of arms A = {1, . . . , K}, and obtaining a sequence of scalar rewards {rt}, such that, if the agent’s action at time t is at = i, then it obtains reward rt drawn from some distribution Pi with expectation µi , E(rt | at = i).

The goal of the decision maker is to draw arms so as to maximize the total rewardPT

t=1rtobtained.

This problem is a model for many applications where there is a need for trading-off exploration and exploitation.

This occurs because we only see the reward of the arm we pull. An example is clinical trials, where arms correspond to different treatments or tests, and the goal can be to max- imise the number of cured patients over time while being un- certain about the effects of treatments. Other problems, such as search engine advertisement and movie recommendations can be formalised similarly (Pandey and Olston 2006).

It has been previously noted (Jain, Kothari, and Thakurta 2012; Thakurta and Smith 2013; Mishra and Thakurta 2015;

Zhao et al. 2014) that privacy is an important consideration for many multi-armed bandit applications. Indeed, privacy can be easily violated by observing changes in the predic- tion of the bandit algorithm. This has been demonstrated for Copyright c2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

recommender systems such as Amazon by (Calandrino et al.

2011) and for user-targeted advertising such as Facebook by (Korolova 2010). In both cases, with a moderate amount of side information and by tracking changes in the output of the system, it was possible to learn private information of any targeted user.

Differential privacy (DP) (Dwork 2006) provides an an- swer to this privacy issue by making the output of an algorithm almost insensitive to any single user information. That is, no matter what side information is available to an outside observer, he can not have more information about a user than he already had by observing the outputs released by the algorithm. This goal is achieved by formally bounding the loss in privacy through the use of two parameters(ǫ, δ)as shown in Definition 2.1.

For bandit problems, differential privacy implies that the actions taken by the bandit algorithm do not reveal information about the sequence of rewards obtained. In the context of clinical trials and diagnostic tests, it guarantees that even an adversary with arbitrary side information, such as the identity of each patient, cannot learn anything from the output of the learning algorithm about patient history, condition, or test results.

1.1 Related Work

Differential privacy (DP) was introduced by (Dwork et al.

2006); a good overview is given in (Dwork and Roth 2013).

While initially the focus in DP was static databases, inter- est in its relation to online learning problems has increased recently. In the full information setting, (Jain, Kothari, and Thakurta 2012) obtained differentially private algorithms with near-optimal bounds. In the bandit setting, (Thakurta and Smith 2013) were the first to present a differentially private algorithm, for the adversarial case, while (Zhao et al. 2014) present an application to smart grids in this setting. Then, (Mishra and Thakurta 2015) provided a differentially private algorithm for the stochastic bandit problem.

Their algorithms are based on two non private stochastic bandit algorithms: Upper Confidence Bound (UCB, (Auer, Cesa-Bianchi, and Fischer 2002)) and Thompson sampling (Thompson 1933). Their results are sub-optimal: al- though simple index-based algorithms achievingO(logT) regret exist (Burnetas and Katehakis 1996; Auer, Cesa- Bianchi, and Fischer 2002), these differentially private algo-

(3)

rithms additional poly-log terms in timeT, as well further linear terms in the number of arms compared to the non- private optimal regretO(logT).

We provide a significantly different and improved UCB- style algorithm whose regret only adds a constant, privacy- dependent term to the optimal. We also improve upon previous algorithms by relaxing the need to know the horizon T ahead of time, and as a result we obtain a uniform bound.

Finally, we also obtain significantly improved bounds for a variant of the original algorithm of (Mishra and Thakurta 2015), by using a different proof technique and confidence intervals. Let’s note that similarly to their result, we only make distributional assumptions on the data for the regret analysis. To ensure privacy, our algorithms do not make any assumption on the data. We summarize our contributions in the next section.

1.2 Our Contributions

• We present a novel differentially private algorithm (DP- UCB-INT) in the stochastic bandit setting that is almost optimal and only add anadditive constantterm (depend- ing on the privacy parameter) to the optimal non private version. Previous algorithms had in large multiplicative factorsto the optimal.

• We also provide an incremental but important improvement to the regret of existing differentially private algorithm in the stochastic bandit using the same family of algorithms as previously presented in the literature. This is done by using a simpler confidence bound and a more sophisticated proof technique. These bounds are achieved by DP-UCB-BOUNDand DP-UCB algorithms.

• We present the first set of differentially private algorithm in the bandit setting which are unbounded and do not re- quire the knowledge of the horizonT. Furthermore, all our regret analysis holds for any time stept.

2 Preliminaries 2.1 Multi-Armed Bandit

The well-known stochastic K-armed bandit problem (Thompson 1933; Lai and Robbins 1985;

Auer, Cesa-Bianchi, and Fischer 2002) involves an agent sequentially choosing among a set of K arms A = {1, . . . , K}. At each time step t, the player selects an action at = i ∈ Aand obtains a rewardrt ∈ [0,1].

The reward rt is drawn from some fixed but unknown distributionPi such thatE(rt | at) = µi. The goal of the decision maker is to draw arms so as to maximize the total reward obtained afterT interactions. An equivalent notion is to minimize the total regret against an agent who knew the arm with the maximum expectation before the game starts and always plays it. This is defined by:

R,T µ_∗−E^π

T

X

t=1

rt. (2.1)

whereµ_∗ , maxa∈Aµa is the mean reward of the optimal arm andπ(at | a1:t−1, r1:t−1)is the policy of the decision maker, defining a probability distribution on the next

actionsat given the history of previous actions a1:t−1 = a1, . . . , at−1and rewardsr1:t−1 =r1, . . . rt−1. Our goal is to bound the regret uniformly overT.

2.2 Differential Privacy

Differential privacy was originally proposed by (Dwork 2006), as a way to formalise the amount of information about theinputof an algorithm, that is leaked to an adversary observing itsoutput, no matter what the adversary’s side information is. In the context of our setup, the algorithm’s input is the sequence of rewards, and its output the actions.

Consequently, we use the following definition of differentially private bandit algorithms.

Definition 2.1 ((ǫ, δ)-differentially private bandit algorithm). A bandit algorithmπis(ǫ, δ)-differentially private if for all sequencesr1:t−1andr_1:t^′ ₋₁that differs in at most one time step, we have for allS⊆ A:

π(at∈S|a1:t−1, r1:t−1)≤π(at∈S|a1:t−1, r^′_1:t₋₁)e^ǫ+δ whereAis the set of actions. Whenδ= 0, the algorithm is said to beǫ-differential private.

Intuitively, this means that changing any rewardrtfor a given arm, will not change too much the best arm released at timetor later on. If eachrtis a private information or a point associated to a single individual, then the definition aboves means that the presence or absence of that individual will not affect too much the output of the algorithm. Hence, the algorithm will not reveal any extra information about this individual leading to a privacy protection. The privacy parameters(ǫ, δ)determines the extent to which an individual entry affects the output; lower values of(ǫ, δ)imply higher levels of privacy.

A natural way to obtain privacy is to add a noise such as Laplace noise (Lap) to the output of the algorithm. The main challenge is how to get the maximum privacy while adding a minimum amount of noise as possible. This leads to a trade off between privacy and utility. In our paper, we demonstrated how to optimally trade-off this two notions.

2.3 Hybrid Mechanism

The hybrid mechanism is an online algorithm used to con- tinually release the sum of some statistics while preserv- ing differential privacy. More formally, there is a stream σt = r1, r2. . . rt of statistics with ri in [0,1]. At each time stepta new statisticrtis given. The goal is to output the partial sum (yt = Pt

i=1ri) of the statistics from time step 1 tot without compromising privacy of the statistics.

In other words, we wish to find a randomised mechanism M(yt|σt, y1:t−1)that is(ǫ, δ)-differential private.

The hybrid mechanism solves this problem by combining the Logarithm and Binary Noisy Sum mechanisms. When- evert= 2^kfor some integerk, it uses the Logarithm mechanism to release a noisy sum by adding Laplace noise of scaleǫ⁻¹. It then builds a binary treeB, which is used to release noisy sums untilt= 2^k+1via the Binary mechanism.

This uses the leaf nodes ofBto store the inputsri, while all other nodes store partial sums, with the root containing the

(4)

sum from2^k to2^k+1−1. Since the tree depth is logarithmic, there is only a logarithmic amount of noise added for any given sum, more specifically Laplace noise of scale^log_ǫ^t and mean0which is denoted byLap(^log_ǫ^t).

(Chan, Shi, and Song 2010) proves that the hybrid mechanism isǫ-differential private for anynwherenis the number of statistics seen so far. They also show that with probability at least1−γ, the error in the released sum is upper bounded by ¹_ǫlog(_γ¹) log^1.5n. In this paper, we derived and used a tighter bound for this same mechanism (see Appendix in Supplementary Material) which is:

√8 ǫ log(4

γ) logn+

√8 ǫ log(4

γ) (2.2)

3 Private Stochastic Multi-Armed Bandits

We describe here the general technique used by our algorithms to obtain differential privacy. Our algorithms are based on the non-private UCB algorithm by (Auer, Cesa- Bianchi, and Fischer 2002). At each time step, UCB based its action according to an optimistic estimate of the expected reward of each arm. This estimate is the sum of the empirical mean and an upper bound confidence equal toq

2 logt na,t

wheret is the time step andna,tthe number of times arm ahas been played till timet. We can observe that the only quantity using the value of the reward is the empirical mean.

To achieve differential privacy, it is enough to make the player based its action on differentially private empirical means for each arm. This is so, because, once the mean of each arm is computed, the action which will be played is a deterministic function of the means. In particular, we can see the differentially private mechanism as a black box, which keeps track of the vector of non-private empirical means Y for the player, and outputs a vector of private empirical meansX. This is then used by the player to select an action, as shown in Figure 1.

We provide three different algorithms that use different techniques to privately compute the mean and calculate the index of each arm. The first, DP-UCB-BOUND, employs the Hybrid mechanism to compute a private mean and then adds a suitable term to the confidence bound to take into account the additional uncertainty due to privacy. The second, DP-UCB employs the same mechanism, but in such a way so as all arms have the same privacy-induced uncertainty; consequently the algorithm then uses the same index as standard UCB. The final one, employs a mechanism that only releases a new mean once at the beginning of each interval. This allows us to obtain the optimal regret rate.

3.1 The DP-UCB-B

OUND

Algorithm

In Algorithm 1, we compute the sum of the rewards of each arm using the hybrid mechanism (Chan, Shi, and Song 2010). However, the number and the variance of Laplace noise added by the hybrid mechanism increases as we keep pulling an arm. This means that the sum of each arm get added different amount of noise bigger than the original confidence bound used by UCB. This makes it difficult to iden- tify the best arms. To solve this issue, we add a tight upper

at rt

Pi Y X

t∈[T]

Figure 1: Graphical model for the empirical and private means.atis the action of the agent, whilertis the reward obtained, which is drawn from the bandit distribution Pi. The vector of empirical meansY is then made into a private vectorXwhich the agent uses to select actions. The rewards are essentially hidden from the agent by the DP mechanism.

Algorithm 1DP-UCB-BOUND

Inputǫ, the differential privacy parameter.

InstantiateK Hybrid Mechanisms (Chan, Shi, and Song 2010); one for each arma.

fort←1toTdo ift≤Kthen

play arma=tand observe the rewardrt

Insertrtto the hybrid mechanisma else

for allarmado

sa(t)←total sum computed using the hybrid mechanisma

ifna,tis a power of 2then νa ← ^√_ǫ⁸log(4t⁴) else

νa ← ^√_ǫ⁸log(4t⁴) logna,t+^√_ǫ⁸log(4t⁴) end if

end for

Pull armat= arg max_a ^s_n^a_a,t^(t)+q

2 logt na,t +_n^ν_a,t^a Observe the rewardrt

Insertrtto the hybrid mechanism for armat

end if end for

bound defined in equation (2.2) on the noise added by the hybrid mechanism.

Theorem 3.2 validates this choice by showing that only O(ǫ⁻¹log logt)factors are added to the optimal non private regret. In theorem 3.1, we demonstrate that Algorithm 1 is indeedǫ-differential private.

Theorem 3.1. Algorithm 1 isǫ-differential private after any number oftof plays.

Proof. This follows directly from the fact that the hybrid mechanism isǫ-DP after any numbertof plays and a single flip of one reward in the sequence of rewards only affect one mechanism. Furthermore, the whole algorithm is a random mapping from the output of the hybrid mechanism to the action taken and using Proposition 2.1 of (Dwork and Roth 2013) completes the proof.

Theorem 3.2 gives the regret for algorithm 1. While here

(5)

we give only a sketch proof, the complete derivation can be found in the supplementary material (Tossou and Dim- itrakakis 2016).

Theorem 3.2. If Algorithm 1 is run withK arms having arbitrary reward distributions, then, its expected regretR after any numbertof plays is bounded by:

R ≤ X

a:µa<µ_∗

max

B(lnB+ 7), 8 λ²₀∆a

logt

+ X

a:µa<µ_∗

∆a+2π²∆a

3

(3.1)

B =

√8

ǫ(1−λ0)·ln(4t⁴)

for anyλ0such that0 < λ0<1whereµ1, . . . ,µK are the expected values ofP1, . . . ,PKand∆a=µ_∗−µa.

Proof Sketch. We used the bound on the hybrid mechanism defined in equation 2.2 together with the union and Chernoff-Hoeffding bounds. We then select the error proba- bilityγat each step to bet⁻⁴. This leads to a transcendental inequality solved using the Lambert W function and approx- imated using section3.1of (Barry et al. 2000).

3.2 The DP-UCB Algorithm

The key observation used in Algorithm 2 is that if at each time step we insert a reward to all hybrid mechanisms, then the scale of the noise will be the same. This means that there is no need anymore to compensate an additional bound.

More precisely, every time we play an armat =aand re- ceive the rewardrt, we not only add it to the hybrid mechanism corresponding to armabut we also add a reward of0to the hybrid mechanism of all other arms. As these calculate a sum, it doesn’t affect subsequent calculations.

Theorem 3.3 shows the validity of this approach by demonstrating a regret bound with only an additional factor ofO(ǫ⁻²log²logt)to the optimal non private regret.

Algorithm 2DP-UCB

Run Algorithm 1 and setνato0

When armat=ais played, insert0to all hybrid mechanisms corresponding to arm a^′ 6= a (Do not increase na^′,t)

Theorem 3.3. If Algorithm 2 is run withK arms having arbitrary reward distributions, then, its expected regretR after any numbertof plays is bounded by:

R ≤ X

a:µa<µ_∗

∆a

max

C²(lnC+ 7)², 8

∆²_a logt

+ X

a:µa<µ_∗

(∆a+ 4∆aζ(1.5))

C= 56(2 +√ 3.5)√

logt ǫ

whereζdenotes the Riemann zeta function.

Proof Sketch. The proof is similar to the one for Theorem 3.2, but we have to choose the error probability to bet⁻^3.5.

3.3 The DP-UCB-I

NT

Algorithm

Both Algorithms 1 and 2 enjoy a logarithmic regret with only a small additional factor in the time steptto the optimal non-private regret. However, this includes a multiplicative factor ofǫ⁻¹andǫ⁻²respectively. Consequently, increasing privacy scales the total regret proportionally. A natural question is whether or not it is possible to get a differentially private algorithm with only anadditiveconstant to the optimal regret. Algorithm 3 answers positively to this question by using novel tricks to achieve differential privacy. Looking at regret analysis of Algorithms 1 and 2, we observe that by adding noise proportional toǫ, we will get a multiplicative factor to the optimal. In other words, to remove this factor, the noise should not depend onǫ. But how can we getǫ-DP in this case?

Note that if we compute and use the mean at each time step with anǫ^′_n_a,t-DP algorithm, then after time stept, our overall privacy is roughly the sum E^′ of all ǫ^′_n_a,t. We then change the algorithm so that it only uses a released mean once every ¹_ǫ times, making privacyǫE^′. In any case,ǫ^′na,t

needs to decrease, at least asn⁻_a,t¹, for the sum to be bounded bylogna,t. However,ǫ^′_n_a,t should also be big enough such that the noise added keeps the UCB confidence interval used at the same order, otherwise, the regret will be higher.

A natural choice forǫ^′_n_a,t is a p-series. Indeed, by mak- ingǫ^′_n_a,t to be of the form ¹

n^v/2_a,t , wherena,tis the number of times actionahas been played until timet, its sum will converge to the Riemann zeta function whenvis appropri- ately chosen. This choice of ǫ^′_n_a,t leads to the addition of a Laplace noise of scale ¹

n^1−v/2_a,t to the mean (See Lemma 3.1). Now our trade-off issue between high privacy and low regret is just reduced into choosing a correct value forv. In- deed, we can pickv >2, for the privacy to converge; but the noise added at each time step will be increasing and greater than the UCB bound; which is not desirable. To overcome this issue, we used the more sophisticatedk-fold adaptive composition theorem (III-3 in (Dwork, Rothblum, and Vad- han 2010)). Roughly speaking, this theorem shows that our overall privacy after releasing the mean a number of times depends on the sum of thesquareof each individual privacy parameterǫ^′_n_a,t. So,v > 1is enough for convergence and withv ≤ 1.5, the noise added will be decreasing and will eventually become lower than the UCB bound.

In summary, we just need to lazily update the mean of each arm every ¹_ǫ times. However, we show that the interval of release is much better than ¹_ǫ and follows a seriesf as defined by Lemma (B.1) in the supplements (Tossou and Dimitrakakis 2016). Algorithm3 summarizes the idea de- veloped in this section.

The next lemma establishes the privacyǫ^′each time a new mean is released for a given arma.

Lemma 3.1. The meanxˆa computed by Algorithm 3 for a

(6)

Algorithm 3DP-UCB-INT(ǫ,v,K,A) Inputǫ∈(0,1]v∈(1,1.5]; privacy rate.

Kis the number of arms andAthe set of all arms.

f ← ⌈¹_ǫ⌉;xˆ←0

(For simplicity, we take the intervalf to be⌈¹_ǫ⌉here) fort←1toTdo

ift≤Kfthen

play arma= (t−1) modK+ 1and observert

else

for alla∈ Ado

ifna,t modf = 0then ˆ

xa ←_n^s_a,t^a +Lap(0, ¹

n^1−v/2_a,t ) +q

2 logt na,t

end if end for

Pull armat= arg max_axˆaand observert

Update sumsa ←sa+rt. end if

end for

given arm a at each interval isn⁻_a,t^v/2-differential private with respect to the reward sequence observed by that arm.

Proof. Sketch This follows directly from the fact that we add Laplace noise of scalen^v/2_a,t⁻¹.

The next theorem establishes the overall privacy after having played forttime steps.

Theorem 3.4. After playing for anyttime steps, Algorithm 3 is(ǫ^′, δ^′)-differential private with

ǫ^′≤min





t

X

n=1

√ǫ n^v, ǫ

t

X

n=1

e^√¹^nv −1

√t^v + v u u tǫ

t

X

n=1

2 ln_δ¹′

n^v





for anyδ^′∈(0,1],ǫ∈(0,1]

Proof Sketch. We begin by using similar observations as in Theorem 3.1. Then, we compute the privacy of the mean of an arm using thek-fold adaptive composition theorem in (Dwork, Rothblum, and Vadhan 2010) (see the supplements (Tossou and Dimitrakakis 2016)).

The next corollary gives a nicer closed form for the privacy parameter which is needed in practice.

Corollary 3.1. After playing forttime steps, Algorithm 3 is (ǫ^′, δ^′)-differential private with

ǫ^′≤min

ǫt¹⁻^v/2−v/2

1−v/2 ,2ǫζ(v) +p

2ǫζ(v) ln(1/δ^′)

withζ the Riemann Zeta Function for anyδ^′ ∈ (0,1],ǫ ∈ (0,1],v∈(1,1.5].

Proof Sketch. We upper bounded the first term in theorem 3.4 by the integral test, then for the second term we used e^x≤1 + 2xfor allx∈[0,1]to conclude the proof.

The following corollary gives the parameterǫwith which one should run Algorithm 3 to achieve a givenǫ^′privacy.

Corollary 3.2. If you run Algorithm 3 with parameterǫ= r

log_δ′¹+4ǫ^′ 8ζ(v) −

r

log_δ′¹ 8ζ(v)

!2

for anyδ^′ ∈(0,1],ǫ^′ ∈(0,1], v∈(1,1.5], you will be at least(ǫ^′, δ^′)-differential private.

Proof. The proof is obtained by inverting the term using the Riemann zeta function in corollary 3.1.

Finally, we present the regret of Algorithm 3 in theorem 3.5 . A simple observation shows us that it has the same regret as the non private UCB with just an additive constant.

Theorem 3.5. If Algorithm 3 is run withK arms having arbitrary reward distributions, then, its expected regret R after any numbertof plays is bounded by:

R ≤ X

a:µa<µ_∗

∆a

f0+ 8

∆²_a logt+ 1 + 4ζ(1.5)

where f0 ≤ ¹_ǫ. More precisely, f0 is the first value of the seriesf defined in Lemma B.1 in the supplements (Tossou and Dimitrakakis 2016).

Sketch. This is proven using a Laplace concentration inequality to bound the estimate of the mean then we selected the error probability to bet⁻^3.5.

4 Experiments

We perform experiments using arms with rewards drawn from independent Bernoulli distribution. The plot, in logarithmic scale, shows the regret of the algorithms over 100,000time steps averaged over 100 runs. We targeted 2 differentǫprivacy levels : 0.1 and 1. For DP-UCB-INT, we pickǫsuch that the overall privacy is(ǫ^′, δ^′)-DP withǫas defined in corollary 3.2 andδ^′=e⁻¹⁰,ǫ^′∈ {0.1,1}, v= 1.1.

We put in parenthesis theinputprivacy of each algorithm.

We compared against the non private UCB algorithm and the algorithm presented in (Mishra and Thakurta 2015) (Private-UCB) with a failure probability chosen to bet⁻⁴.

We perform two scenarios. Firstly we used two arms: one with expectation 0.9 and the other 0.6. The second scenario is a more challenging one with 10 arms having all an expectation of 0.1 except two with 0.55 and 0.2.

As expected, the performance of DP-UCB-INT is significantly better than all other private algorithms. More im- portantly, the gap between the regret of DP-UCB-INTand the non private UCB does not increase with time confirming the theoretical regret. We can notice that DP-UCB is better than DP-UCB-BOUND for small time steps. However, as the time step increases DP-UCB-BOUNDoutperforms DP- UCB and eventually catches its regret. The reason for that is: DP-UCB spends less time to distinguish between arms with close rewards due to the fact that the additional factor in its regret depends on∆a=µ_∗−µawhich is not the case for DP-UCB. Private-UCB performs worse than all other algorithms which is not surprising.

(7)

0 20000 40000 60000 80000 100000 time step

10⁻¹ 10⁰ 10¹ 10² 10³ 10⁴ 10⁵

Averageregret

UCB

DP-UCB-INT (0.00396431694945, 1.1) DP-UCB (1.0)

DP-UCB-BOUND (1.0) Private-UCB, Mishra et al 2015 (1.0)

(a) Regret forǫ= 1, 2 arms

0 20000 40000 60000 80000 100000

time step 10⁻¹

10⁰ 10¹ 10² 10³ 10⁴ 10⁵

Averageregret

UCB

DP-UCB-INT (0.000253053596336, 1.1) DP-UCB (0.1)

(b) Regret forǫ= 0.1, 2 arms

0 20000 40000 60000 80000 100000

time step 10−1

10⁰ 10¹ 10² 10³ 10⁴ 10⁵

Averageregret

UCB

DP-UCB-INT (0.00396431694945, 1.1) DP-UCB (1.0)

(c) Regret forǫ= 1, 10 arms

0 20000 40000 60000 80000 100000

time step 10−1

10⁰ 10¹ 10² 10³ 10⁴ 10⁵

Averageregret

UCB

DP-UCB-INT (0.000253053596336, 1.1) DP-UCB (0.1)

(d) Regret forǫ= 0.1, 10 arms

Figure 2: Experimental results with 100 runs, 2 or 10 arms with rewards:{0.9,0.6}or{0.1. . .0.2,0.55,0.1. . .}.

Moreover, we noticed that the difference between the best regret (after 100 runs) and worst regret is very consistent for all ours algorithms and the non private UCB (it is under 664.5 for the 2 arms scenario). However, this gap reaches 30,000for Private-UCB. This means that our algorithms are able to correctly trade-off between exploration and exploitation which is not the case for Private-UCB.

5 Conclusion and Future Work

In this paper, we have proposed and analysed differentially private algorithms for the stochastic multi-armed bandit problem, significantly improving upon the state of the art. The first two, (DP-UCB and DP-UCB-BOUND) are variants of an existing private UCB algorithm (Mishra and Thakurta 2015), while the third one uses an interval-based mechanism.

Those first two algorithms are only within a factor of O(ǫ⁻¹log logt) and O(ǫ⁻²log²logt) to the non-private algorithm. The last algorithm, DP-UCB-INT, efficiently

trades off the privacy level and the regret and is able to achieve the same regret as the non-private algorithm up to an additionaladditiveconstant. This has been achieved by using two key tricks: updating the mean of each arm lazily with a frequency proportional to the privacyǫ⁻¹and adding a noise independent ofǫ. Intuitively, the algorithm achieves better privacy without increasing regret, because its output is less dependent on individual reward.

Perhaps it is possible to improve our bounds further if we are willing to settle for asymptotically low regret (Cowan and Katehakis 2015). A natural future work is to study if we can use similar methods for other mechanisms such as Thompson sampling (known to be differentially private (Dimitrakakis et al. 2014)) instead of UCB. Another question is whether a similar analysis can be performed for adversarialbandits.

We would also like to connect more to applications by two extensions of our algorithms. The first natural extension is to consider some side-information. In the drug testing exam-

(8)

ple, this could include some information about the drug, the test performed and the user examined or treated. The second extension would relate to generalising the notion of neigh- bouring databases to take into account the fact that multiple observations in the sequence (saym) can be associated with a single individual. Our algorithms can be easily extended to deal with this setting (by re-scaling the privacy parameter to

ǫ

m). However, in practice,mcould be quite large and it will be an interesting future work to check if we could get sub linearity in the parametermunder certain conditions.

References

Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite time analysis of the multiarmed bandit problem. Machine Learning47(2/3):235–256.

Barry, D.; Parlange, J.-Y.; Li, L.; Prommer, H.; Cunning- ham, C.; and Stagnitti, F. 2000. Analytical approximations for real values of the lambert w-function. Mathematics and Computers in Simulation53(12):95 – 103.

Burnetas, A. N., and Katehakis, M. N. 1996. Optimal adaptive policies for sequential allocation problems.Advances in Applied Mathematics17(2):122–142.

Calandrino, J. A.; Kilzer, A.; Narayanan, A.; Felten, E. W.;

and Shmatikov, V. 2011. ”you might also like: ” privacy risks of collaborative filtering. In32nd IEEE Symposium on Security and Privacy, 231–246.

Chan, T. H.; Shi, E.; and Song, D. 2010. Private and continual release of statistics. InAutomata, Languages and Pro- gramming. Springer. 405–417.

Cowan, W., and Katehakis, M. N. 2015. Asymptotic behavior of minimal-exploration allocation policies: Al- most sure, arbitrarily slow growing regret. arXiv preprint arXiv:1505.02865.

Dimitrakakis, C.; Nelson, B.; Mitrokotsa, A.; and Rubin- stein, B. 2014. Robust and private Bayesian inference. In Algorithmic Learning Theory.

Dwork, C., and Roth, A. 2013. The algorithmic foundations of differential privacy. Foundations and Trends in Theoreti- cal Computer Science9(34):211–407.

Dwork, C.; McSherry, F.; Nissim, K.; and Smith, A. 2006.

Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptog- raphy, TCC’06, 265–284.

Dwork, C.; Rothblum, G. N.; and Vadhan, S. 2010. Boosting and differential privacy. InProceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Sci- ence, FOCS ’10, 51–60.

Dwork, C. 2006. Differential privacy. InICALP, 1–12.

Springer.

Jain, P.; Kothari, P.; and Thakurta, A. 2012. Differen- tially private online learning. In Mannor, S.; Srebro, N.; and Williamson, R. C., eds.,COLT 2012 - The 25th Annual Con- ference on Learning Theory, volume 23, 24.1–24.34.

Korolova, A. 2010. Privacy violations using microtargeted ads: A case study. InICDMW 2010, The 10th IEEE Interna- tional Conference on Data Mining Workshops, 474–482.

Lai, T. L., and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1):4–22.

Mishra, N., and Thakurta, A. 2015. (nearly) optimal differentially private stochastic multi-arm bandits.Proceedings of the 31th International Conference on Conference on Uncer- tainty in Artificial Intelligence (UAI-2015).

Pandey, S., and Olston, C. 2006. Handling advertisements of unknown quality in search advertising. In Sch¨olkopf, B.;

Platt, J. C.; and Hoffman, T., eds.,Advances in Neural In- formation Processing Systems 19, Proceedings of the Twen- tieth Annual Conference on Neural Information Processing, 1065–1072.

Robbins, H., et al. 1952. Some aspects of the sequential de- sign of experiments.Bulletin of the American Mathematical Society58(5):527–535.

Thakurta, A. G., and Smith, A. D. 2013. (nearly) optimal algorithms for private online learning in full-information and bandit settings. InAdvances in Neural Information Process- ing Systems 26: 27th Annual Conference on Neural Infor- mation Processing Systems 2013., 2733–2741.

Thompson, W. 1933. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of two Samples. Biometrika25(3-4):285–294.

Tossou, A., and Dimitrakakis, C. 2016.

Supplementary Materials for Algorithms for Differentially Private Multi-Armed Bandits.

https://www.dropbox.com/s/fotezpnx49nz0i3/single- mab-aaai16-sups-final.pdf. [Online; accessed 1-December- 2015].

Zhao, J.; Jung, T.; Wang, Y.; and Li, X. 2014. Achieving differential privacy of data disclosure in the smart grid. In 2014 IEEE Conference on Computer Communications, IN- FOCOM 2014, 504–512.

A Collected proofs A.1 Proof of Theorem 3.2

awill be used to indicate the index of an arm.ǫis the differential privacy parameter.tis used to denote the time step.

From Lemma (B.2), we know that the error between the empirical and private mean is bounded as|Y−X| ≤hnwith probability at least1−γ,Xis the empirical mean returned by the private mechanism, Y the true empirical mean,hn

the error due to the differentially private mechanism. It is defined as:hn= ¹_ǫ·√

8(logn)·ln⁴_γ·_n¹+¹_ǫ·√

8·ln⁴_γ·_n¹. We can rewrite this bound into equations A.1 and A.2.

Pr(X ≥Y +hn)≤γ (A.1) Pr(X ≤Y −hn)≤γ. (A.2) LetTa(s)be the number of times arm a is played in the firststime steps. Let’sct,n,p

(2 lnt)/ndenote the original UCB confidence index.

By following similar steps as in the demonstration of

(9)

UCB in (Auer, Cesa-Bianchi, and Fischer 2002), we have Ta(s) = 1 +

s

X

t=K+1

{at=a}

≤ℓ+

∞

X

t=1 t−1

X

n=1 t−1

X

na=ℓ

{X_n^∗+ct,n+hn≤Xa,na+ct,na+hna} (A.3) In equation A.3,X_n^∗is the mean returned by the private mechanism for the best arm when it has been playedntimes.

Now we can observe thatX_n^∗+ct,n+hn≤Xa,na+ct,na+ hnaimplies that at least one of the following must hold

X_n^∗≤µ^∗−ct,n−hn (A.4) Xa,na≥µa+ct,na+hna (A.5) µ^∗< µa+ 2ct,n+ 2hn (A.6) We can bound the probability of events (A.4) using equation (A.2), the union bound and the Chernoff-Hoeffding bound.

Pr(A.4) = Pr(X_n^∗≤µ^∗−ct,n−hn)

= Pr(Xn^∗≤Yn^∗−hn∨Yn^∗≤µ^∗−ct,n)

≤Pr(X_n^∗≤Y_n^∗−hn) + Pr(Y_n^∗≤µ^∗−ct,n)

≤γ+ exp(−4 logt) =γ+t⁻⁴. (A.7) Similarly, to prove a bound on the probability (A.5) we use (A.1) , the union bound and the Chernoff-Hoeffding bound.

Pr(A.5) = Pr(Xa,na≥µa+ct,na+hna)

= Pr(Xa,na≥Ya,na+hna∨Ya,na≥µa+ct,na)

≤Pr(Xa,na≥Ya,na+hna) + Pr(Ya,na ≥µa+ct,na)

≤γ+ exp(−4 logt) =γ+t⁻⁴. (A.8) Let’s chooseγ=t⁻⁴; this leads respectively to

Pr(A.4)≤2t⁻⁴ (A.9) Pr(A.5)≤2t⁻⁴ (A.10) Now consider the last condition (A.6). For this, we want to find the minimum numbernfor which event (A.6) is always false. Event (A.6) is false, implies that∆a >2ct,n+ 2hn

where∆a =µ_∗−µa. We observe that for∆a >2ct,n+2hn

to hold, it is enough that the following two conditions hold for anyλ0such that0< λ0<1.

λ0∆a>2ct,n (A.11) (1−λ0)∆a>2hn (A.12) From inequality (A.11), we have

n≥ 8

λ²₀∆²_a log(t). (A.13) Equation (A.12) leads to

n≥B·log(n) +B

with

B=

√8

ǫ(1−λ0)∆a ·ln4 γ

=

√8

ǫ(1−λ0)∆a ·ln(4t⁴)

We can rewrite the inequality in a more familiar form:

e⁻⁽⁻^B¹⁾ⁿ≥e·n

This is a standard transcendental algebraic inequality whose solution is given by the LambertW function. So,

n≥ −B·W −1

e·B,−1

whereW(x, k)is the Lambert function ofxon branchk.

Note here that the branch is -1 and because ⁻_e¹ < _e⁻_·_B¹ <

0 ∀t >1, we are always guaranteed to find a real number.

By using the approximation of the Lambert function provided in section3.1of (Barry et al. 2000), we can conclude that

n≥ −B·W −1

e·B,−1

≈ −B

ln( 1

e·B)− 2 0.3361

≥B(lnB+ 7)

=

√8·ln(4t⁴) ǫ(1−λ0)∆a ln

√8·ln(4t⁴) ǫ(1−λ0)∆a

! + 7

!

(A.14) Combining inequalities (A.13) and (A.14) yields,

n≥max

B(ln (B) + 7), 8

λ²₀∆²_a log(t)

In summary, Ta(s)≤

max

B(lnB+ 7), 8

λ²₀∆²_a log(t)

+ X∞

t=1 t−1

X

n=1 t−1

X

na=ℓ

4t⁻⁴

≤1 + max

B(lnB+ 7), 8 λ²₀∆²_alogt

+

∞

X

t=1

4t⁻²

≤1 + max

B(lnB+ 7), 8 λ²₀∆²_alogt

+2π²

3 (A.15) which concludes the proof.

A.2 Proof for Theorem 3.3

The proof of this theorem is similar to the one for theorem 3.5 withhn= ¹_ǫ·√

8(logn)·ln⁴_γ·n¹+¹_ǫ·√

8·ln⁴_γ·n¹. We make the same choice ofλandγ. However, the minimum numberncompatible with these choices lead a transcendental equations in the same form as the one in the proof of theorem 3.2.

(10)

B Proofs for UCB-Interval Algorithm

Fact B.1. Differential privacy of the Laplace mecha- nism (See Theorem 4 in (Dwork and Roth 2013)) For any real functiong of the data, a mechanism adding Laplace noise with scale parameterβis∆g/β-differentially private, where∆gis theL1sensitivity ofg.

B.1 Proof of Lemma 3.1

Proof. Indeed, for each arm, we are adding a Laplace noise of mean 0 and scale n^v/2_a,t⁻¹ wherena,t is the number of times this arm has been played. As the sensitivity of the mean is _n¹_a,t, we use the differential privacy of the Laplace Mechanism (Fact B.1) to conclude the proof.

B.2 Proof of Theorem 3.4

Similarly to the proof of Theorem 3.1, the overall privacy of Algorithm 3 will be the same as the overall privacy of the Mechanism computing the mean of the rewards received from a single arm.

A new Laplace Mechanism is used to compute the mean of each arm. However a new mean is only releasedt/ftimes (everyftime steps) afterttimes steps wherefis the interval used.

According to the k-fold adaptive composition theorem (III-3 in (Dwork, Rothblum, and Vadhan 2010)), Algorithm 3 will be(ǫ^′, δ^′)differential private for anyδ^′ ∈(0,1]with

ǫ^′≤min{C, D}

C= X

n∈I^na₁

ǫ

√1 n^v

D= X

n∈I^na₁

ǫ

√1

n^v(e^√¹^nv −1) + v u u t

X

n∈I^na₁

ǫ

2

n^vln(1/δ^′)

whereIⁿ1^a ǫ

={¹_ǫ,²_ǫ,³_ǫ,· · ·na} We have

D=





 X

n∈I^na₁

ǫ

√1

n^v(e^√¹^nv −1) + v u u t

X

n∈I^na₁

ǫ

2

n^vln(1/δ^′)







(B.1)

≤





 X

n∈I^t₁

ǫ

√1

n^v(e^√¹^nv −1) + v u u t

X

n∈I^t₁

ǫ

2

n^vln(1/δ^′)







(B.2)

≤



ǫ

t

X

n=1

√1

n^v(e^√¹^nv −1) + v u u tǫ

t

X

n=1

2

n^vln(1/δ^′)



 (B.3)

C= X

n∈I^na₁

ǫ

√1

n^v (B.4)

≤ X

n∈I^t₁

ǫ

√1

n^v (B.5)

≤ǫ

t

X

n=1

√1

n^v (B.6)

which concludes the proof.

Lemma B.1(Values of the interval of release of the means).

The intervalf by which we should update the mean follows a series such thatfn =Wn+1−Wnwith:











W0= 0

Wn+1is eitherxorywith x= inf

x^′∈N







x^′≥Wn+ 1 :

x^′

X

i=Wn+1

√1

i^v ≥ 1 ǫ√

x^′^v







y= inf

y^′∈N







y^′≥Wn+ 1 :

y^′

X

i=Wn+1

1 i^v ≥ 1

ǫy^′^v





 Furthermorefnis always such thatfn≤₁

ǫ

Note that only one of the expression forWn+1should be used up to the horizon T.

Proof. From the proof of theorem 3.4, we can easily see that we can pass from lines (B.2) to (B.3) if the conditions ony in Lemma B.1 is verified. Similarly, we can pass from lines (B.5) to (B.6) if the conditions ony in Lemma B.1 is verified. In both cases, ¹_ǫ is an upper bound on the number x−Wn andy −Wn as the original interval used in both lines (B.2) and (B.5) is ¹_ǫ.

B.3 Proof of Corollary 3.1

From theorem 3.4, we know that Algorithm 3 will be(ǫ^′, δ^′) differential private for anyδ^′∈(0,1],1< v≤1.5with

ǫ^′≤min{C, D}

C=

t

X

n=1

√ǫ n^v

D=ǫ

t

X

n=1

e^√¹^nv −1

√t^v + v u u tǫ

t

X

n=1

2 ln_δ¹′

n^v

It is easy to get an upper bound forC using the integral test inequality giving:

C≤ǫ^t^1−v/₁₋²_v/2⁻^v/2

We will now simplifyDusing a standard approximation for exponential function.

(11)

D=ǫ

t

X

n=1

e^√¹^nv −1

√t^v + v u u tǫ

t

X

n=1

2 ln_δ¹′

n^v (B.7)

≤



ǫ

∞

X

t=1

√1

t^v(e^√¹^tv −1) + v u u tǫ

∞

X

t=1

2

t^vln(1/δ^′)



 (B.8)

≤



ǫ X∞

t=1

2( 1

√t^v)²+ v u u tǫ

X∞

t=1

2

t^vln(1/δ^′)



 (B.9)

=

2ǫζ(v) +p

2ǫζ(v) ln(1/δ^′)

(B.10) whereζis the Riemann zeta function.

B.4 Proof of Corollary 3.2

The proof is immediate from corollary 3.1. It is obtained by inverting the term using the Riemann zeta function in corollary 3.1.

B.5 Proof of Theorem 3.5

Fact B.2 (Fact 3.7 in (Dwork and Roth 2013)). If Y ∼ Lap(b)then

Pr(|Y|≥bln1

γ) =γ. (B.11)

First, let’s ignore the effect of computing the means per interval and assume that we compute it at each time step after playing each armf0 times wheref0 ≤ ¹_ǫ is the first value of the seriesf defined in lemma (B.1). As much of the proof is quite similar to that of Theorem 3.2, we shall omit some steps.

Similarly to the proof of Theorem 3.2, we will take a bad arm when one of the following 3 events happens:

X_n^∗≤µ^∗−ct,n (B.12) Xa,na≥µa+ct,na (B.13) µ^∗< µa+ 2ct,n (B.14) Since we are adding Laplace noise, we can use Fact B.2 to show that:

Pr{|Y −X| ≥log(1

γ)n^v/2⁻¹} ≤γ (B.15) Lethn= log(¹_γ)n^v/2⁻¹. Now we can bound the probability of events (B.13) using equation (B.15), the union bound and the Chernoff-Hoeffding bound. In order to be more precise, we useXa,naforxˆaandY to denote the empirical meanx¯a

at timet, whilect,na=p

2 log(t)/naas before.

Pr(B.13) = Pr(Xa,na≥µa+ct,na)

= Pr(Xa,na≥Ya,na+hna∨Ya,na≥µa+ct,na−hna)

≤Pr(Xa,na≥Ya,na+hna) + Pr(Ya,na ≥µa+ct,na−hna)

≤γ+ Pr(Ya,na ≥µa+ct,na−hna). (B.16)

≤γ+ exp(−2na(ct,na−hna)²) (B.17)

≤γ+t⁻^3.5 (B.18)

≤2t⁻^3.5 (B.19)

In the above, we choosehna ≤λ4ct,nawithλ4= 1−^√^3.52

andγ=t⁻^3.5

Similarly, we prove a bound on the probability (B.12):

Pr(B.12) = Pr(X_n^∗≤µ^∗−ct,n)

= Pr(X_n^∗≤Y_n^∗−hna∨Y_n^∗≤µ^∗−ct,n+hna)

≤Pr(X_n^∗≤Y_n^∗−hna) + Pr(Y_n^∗≤µ^∗−ct,n+hna)

≤γ+t⁻^3.5= 2t⁻^3.5. (B.20) Now, Let’s compute the minimum value forna which is consistent with our choice of λ4andγ. It is easy to show thatna ≥(_3.5²⁻√^√2 log^3.5t)^v−1² . And this number converges to 0 as t increases.

The final event B.14 is exactly the standard UCB event and B.14 will be false for allna ≥∆⁸²_alogt.

Now, we can easily see that the effect of the interval in the algorithm will not change this number. Indeed, because we are only updating the mean eachf steps, the numberna

should be divided byf. Now, after updating the mean, we play this arm for a number of steps lower or equal tof. So, we should multiplyna byf which cancel out the effect of the division.

In summary, (and similarly to Theorem 3.2) Ta(s)≤f0+

&

max ( 2−√ 3.5 3.5√

2 logt)^v−1² , 8

∆²_a logt

!' +

∞

X

t=1 t−1

X

n=1 t−1

X

na=ℓ

4t⁻^3.5

≤f0+ 8

∆²_alogt

+ X∞

t=1

4t⁻^1.5

≤f0+ 1 + 8

∆²_alogt+ 4ζ(1.5)

≤f0+ 1 + 8

∆²_alogt+ 4ζ(1.5) (B.21) Lemma B.2(Improved (Chan, Shi, and Song 2010) bound).

For anyγ≤n⁻^b, whereb >0, Chan’s hybrid mechanism is ǫ-differential private and has an error bounded with proba- bility at least1−γby ^√_ǫ⁸log(⁴_γ) logn+^√_ǫ⁸log(_γ⁴).

Proof. Our improvement is based on two observations.

Firstly, ifγ≤n⁻^bthen we can use the alternative concentration bound for the sum of the Laplace distribution. Secondly, we note that given all the previous released sums, the current sum is only drawn from one of the two mechanisms. Thus there is no need to apply a composition theorem.