• Aucun résultat trouvé

Algorithms for Differentially Private Multi-Armed Bandits

N/A
N/A
Protected

Academic year: 2021

Partager "Algorithms for Differentially Private Multi-Armed Bandits"

Copied!
11
0
0

Texte intégral

(1)

HAL Id: hal-01234427

https://hal.inria.fr/hal-01234427

Submitted on 26 Nov 2015

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Algorithms for Differentially Private Multi-Armed Bandits

Aristide Tossou, Christos Dimitrakakis

To cite this version:

Aristide Tossou, Christos Dimitrakakis. Algorithms for Differentially Private Multi-Armed Bandits.

AAAI 2016, Feb 2016, Phoenix, Arizona, United States. �hal-01234427�

(2)

Algorithms for Differentially Private Multi-Armed Bandits

Aristide C. Y. Tossou and Christos Dimitrakakis

Chalmers University of Technology, Gothenburg, Sweden {aristide, chrdimi}@chalmers.se

Abstract

We present differentially private algorithms for the stochastic Multi-Armed Bandit (MAB) problem. This is a problem for applications such as adaptive clinical trials, experiment de- sign, and user-targeted advertising where private information is connected to individual rewards. Our major contribution is to show that there exist(ǫ, δ) differentially private vari- ants of Upper Confidence Bound algorithms which have op- timal regret,O(ǫ−1+ logT). This is a significant improve- ment over previous results, which only achieve poly-log re- gretO(ǫ−2log2T), because of our use of a novel interval- based mechanism. We also substantially improve the bounds of previous family of algorithms which use a continual re- lease mechanism. Experiments clearly validate our theoreti- cal bounds.

1 Introduction

The well-known stochastic K-armed bandit prob- lem (Thompson 1933; Robbins and others 1952) involves an agent sequentially choosing among a set of arms A = {1, . . . , K}, and obtaining a sequence of scalar rewards {rt}, such that, if the agent’s action at time t is at = i, then it obtains reward rt drawn from some distribution Pi with expectation µi , E(rt | at = i).

The goal of the decision maker is to draw arms so as to maximize the total rewardPT

t=1rtobtained.

This problem is a model for many applications where there is a need for trading-off exploration and exploitation.

This occurs because we only see the reward of the arm we pull. An example is clinical trials, where arms correspond to different treatments or tests, and the goal can be to max- imise the number of cured patients over time while being un- certain about the effects of treatments. Other problems, such as search engine advertisement and movie recommendations can be formalised similarly (Pandey and Olston 2006).

It has been previously noted (Jain, Kothari, and Thakurta 2012; Thakurta and Smith 2013; Mishra and Thakurta 2015;

Zhao et al. 2014) that privacy is an important consideration for many multi-armed bandit applications. Indeed, privacy can be easily violated by observing changes in the predic- tion of the bandit algorithm. This has been demonstrated for Copyright c2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

recommender systems such as Amazon by (Calandrino et al.

2011) and for user-targeted advertising such as Facebook by (Korolova 2010). In both cases, with a moderate amount of side information and by tracking changes in the output of the system, it was possible to learn private information of any targeted user.

Differential privacy (DP) (Dwork 2006) provides an an- swer to this privacy issue by making the output of an algo- rithm almost insensitive to any single user information. That is, no matter what side information is available to an outside observer, he can not have more information about a user than he already had by observing the outputs released by the al- gorithm. This goal is achieved by formally bounding the loss in privacy through the use of two parameters(ǫ, δ)as shown in Definition 2.1.

For bandit problems, differential privacy implies that the actions taken by the bandit algorithm do not reveal infor- mation about the sequence of rewards obtained. In the con- text of clinical trials and diagnostic tests, it guarantees that even an adversary with arbitrary side information, such as the identity of each patient, cannot learn anything from the output of the learning algorithm about patient history, con- dition, or test results.

1.1 Related Work

Differential privacy (DP) was introduced by (Dwork et al.

2006); a good overview is given in (Dwork and Roth 2013).

While initially the focus in DP was static databases, inter- est in its relation to online learning problems has increased recently. In the full information setting, (Jain, Kothari, and Thakurta 2012) obtained differentially private algorithms with near-optimal bounds. In the bandit setting, (Thakurta and Smith 2013) were the first to present a differentially private algorithm, for the adversarial case, while (Zhao et al. 2014) present an application to smart grids in this set- ting. Then, (Mishra and Thakurta 2015) provided a differ- entially private algorithm for the stochastic bandit problem.

Their algorithms are based on two non private stochastic bandit algorithms: Upper Confidence Bound (UCB, (Auer, Cesa-Bianchi, and Fischer 2002)) and Thompson sam- pling (Thompson 1933). Their results are sub-optimal: al- though simple index-based algorithms achievingO(logT) regret exist (Burnetas and Katehakis 1996; Auer, Cesa- Bianchi, and Fischer 2002), these differentially private algo-

(3)

rithms additional poly-log terms in timeT, as well further linear terms in the number of arms compared to the non- private optimal regretO(logT).

We provide a significantly different and improved UCB- style algorithm whose regret only adds a constant, privacy- dependent term to the optimal. We also improve upon pre- vious algorithms by relaxing the need to know the horizon T ahead of time, and as a result we obtain a uniform bound.

Finally, we also obtain significantly improved bounds for a variant of the original algorithm of (Mishra and Thakurta 2015), by using a different proof technique and confidence intervals. Let’s note that similarly to their result, we only make distributional assumptions on the data for the regret analysis. To ensure privacy, our algorithms do not make any assumption on the data. We summarize our contributions in the next section.

1.2 Our Contributions

• We present a novel differentially private algorithm (DP- UCB-INT) in the stochastic bandit setting that is almost optimal and only add anadditive constantterm (depend- ing on the privacy parameter) to the optimal non private version. Previous algorithms had in large multiplicative factorsto the optimal.

• We also provide an incremental but important improve- ment to the regret of existing differentially private algo- rithm in the stochastic bandit using the same family of algorithms as previously presented in the literature. This is done by using a simpler confidence bound and a more sophisticated proof technique. These bounds are achieved by DP-UCB-BOUNDand DP-UCB algorithms.

• We present the first set of differentially private algorithm in the bandit setting which are unbounded and do not re- quire the knowledge of the horizonT. Furthermore, all our regret analysis holds for any time stept.

2 Preliminaries 2.1 Multi-Armed Bandit

The well-known stochastic K-armed bandit prob- lem (Thompson 1933; Lai and Robbins 1985;

Auer, Cesa-Bianchi, and Fischer 2002) involves an agent sequentially choosing among a set of K arms A = {1, . . . , K}. At each time step t, the player selects an action at = i ∈ Aand obtains a rewardrt ∈ [0,1].

The reward rt is drawn from some fixed but unknown distributionPi such thatE(rt | at) = µi. The goal of the decision maker is to draw arms so as to maximize the total reward obtained afterT interactions. An equivalent notion is to minimize the total regret against an agent who knew the arm with the maximum expectation before the game starts and always plays it. This is defined by:

R,T µ−Eπ

T

X

t=1

rt. (2.1)

whereµ , maxa∈Aµa is the mean reward of the opti- mal arm andπ(at | a1:t1, r1:t1)is the policy of the de- cision maker, defining a probability distribution on the next

actionsat given the history of previous actions a1:t1 = a1, . . . , at1and rewardsr1:t1 =r1, . . . rt1. Our goal is to bound the regret uniformly overT.

2.2 Differential Privacy

Differential privacy was originally proposed by (Dwork 2006), as a way to formalise the amount of information about theinputof an algorithm, that is leaked to an adversary observing itsoutput, no matter what the adversary’s side in- formation is. In the context of our setup, the algorithm’s in- put is the sequence of rewards, and its output the actions.

Consequently, we use the following definition of differen- tially private bandit algorithms.

Definition 2.1 ((ǫ, δ)-differentially private bandit algo- rithm). A bandit algorithmπis(ǫ, δ)-differentially private if for all sequencesr1:t1andr1:t 1that differs in at most one time step, we have for allS⊆ A:

π(at∈S|a1:t1, r1:t1)≤π(at∈S|a1:t1, r1:t1)eǫ+δ whereAis the set of actions. Whenδ= 0, the algorithm is said to beǫ-differential private.

Intuitively, this means that changing any rewardrtfor a given arm, will not change too much the best arm released at timetor later on. If eachrtis a private information or a point associated to a single individual, then the definition aboves means that the presence or absence of that individual will not affect too much the output of the algorithm. Hence, the algorithm will not reveal any extra information about this individual leading to a privacy protection. The privacy pa- rameters(ǫ, δ)determines the extent to which an individual entry affects the output; lower values of(ǫ, δ)imply higher levels of privacy.

A natural way to obtain privacy is to add a noise such as Laplace noise (Lap) to the output of the algorithm. The main challenge is how to get the maximum privacy while adding a minimum amount of noise as possible. This leads to a trade off between privacy and utility. In our paper, we demonstrated how to optimally trade-off this two notions.

2.3 Hybrid Mechanism

The hybrid mechanism is an online algorithm used to con- tinually release the sum of some statistics while preserv- ing differential privacy. More formally, there is a stream σt = r1, r2. . . rt of statistics with ri in [0,1]. At each time stepta new statisticrtis given. The goal is to output the partial sum (yt = Pt

i=1ri) of the statistics from time step 1 tot without compromising privacy of the statistics.

In other words, we wish to find a randomised mechanism M(ytt, y1:t1)that is(ǫ, δ)-differential private.

The hybrid mechanism solves this problem by combining the Logarithm and Binary Noisy Sum mechanisms. When- evert= 2kfor some integerk, it uses the Logarithm mech- anism to release a noisy sum by adding Laplace noise of scaleǫ1. It then builds a binary treeB, which is used to re- lease noisy sums untilt= 2k+1via the Binary mechanism.

This uses the leaf nodes ofBto store the inputsri, while all other nodes store partial sums, with the root containing the

(4)

sum from2k to2k+1−1. Since the tree depth is logarith- mic, there is only a logarithmic amount of noise added for any given sum, more specifically Laplace noise of scalelogǫt and mean0which is denoted byLap(logǫt).

(Chan, Shi, and Song 2010) proves that the hybrid mech- anism isǫ-differential private for anynwherenis the num- ber of statistics seen so far. They also show that with prob- ability at least1−γ, the error in the released sum is upper bounded by 1ǫlog(γ1) log1.5n. In this paper, we derived and used a tighter bound for this same mechanism (see Appendix in Supplementary Material) which is:

√8 ǫ log(4

γ) logn+

√8 ǫ log(4

γ) (2.2)

3 Private Stochastic Multi-Armed Bandits

We describe here the general technique used by our al- gorithms to obtain differential privacy. Our algorithms are based on the non-private UCB algorithm by (Auer, Cesa- Bianchi, and Fischer 2002). At each time step, UCB based its action according to an optimistic estimate of the expected reward of each arm. This estimate is the sum of the empir- ical mean and an upper bound confidence equal toq

2 logt na,t

wheret is the time step andna,tthe number of times arm ahas been played till timet. We can observe that the only quantity using the value of the reward is the empirical mean.

To achieve differential privacy, it is enough to make the player based its action on differentially private empirical means for each arm. This is so, because, once the mean of each arm is computed, the action which will be played is a deterministic function of the means. In particular, we can see the differentially private mechanism as a black box, which keeps track of the vector of non-private empirical means Y for the player, and outputs a vector of private empirical meansX. This is then used by the player to select an action, as shown in Figure 1.

We provide three different algorithms that use different techniques to privately compute the mean and calculate the index of each arm. The first, DP-UCB-BOUND, employs the Hybrid mechanism to compute a private mean and then adds a suitable term to the confidence bound to take into account the additional uncertainty due to privacy. The sec- ond, DP-UCB employs the same mechanism, but in such a way so as all arms have the same privacy-induced uncer- tainty; consequently the algorithm then uses the same index as standard UCB. The final one, employs a mechanism that only releases a new mean once at the beginning of each in- terval. This allows us to obtain the optimal regret rate.

3.1 The DP-UCB-B

OUND

Algorithm

In Algorithm 1, we compute the sum of the rewards of each arm using the hybrid mechanism (Chan, Shi, and Song 2010). However, the number and the variance of Laplace noise added by the hybrid mechanism increases as we keep pulling an arm. This means that the sum of each arm get added different amount of noise bigger than the original con- fidence bound used by UCB. This makes it difficult to iden- tify the best arms. To solve this issue, we add a tight upper

at rt

Pi Y X

t∈[T]

Figure 1: Graphical model for the empirical and private means.atis the action of the agent, whilertis the reward obtained, which is drawn from the bandit distribution Pi. The vector of empirical meansY is then made into a private vectorXwhich the agent uses to select actions. The rewards are essentially hidden from the agent by the DP mechanism.

Algorithm 1DP-UCB-BOUND

Inputǫ, the differential privacy parameter.

InstantiateK Hybrid Mechanisms (Chan, Shi, and Song 2010); one for each arma.

fort←1toTdo ift≤Kthen

play arma=tand observe the rewardrt

Insertrtto the hybrid mechanisma else

for allarmado

sa(t)←total sum computed using the hybrid mechanisma

ifna,tis a power of 2then νaǫ8log(4t4) else

νaǫ8log(4t4) logna,t+ǫ8log(4t4) end if

end for

Pull armat= arg maxa snaa,t(t)+q

2 logt na,t +nνa,ta Observe the rewardrt

Insertrtto the hybrid mechanism for armat

end if end for

bound defined in equation (2.2) on the noise added by the hybrid mechanism.

Theorem 3.2 validates this choice by showing that only O(ǫ1log logt)factors are added to the optimal non private regret. In theorem 3.1, we demonstrate that Algorithm 1 is indeedǫ-differential private.

Theorem 3.1. Algorithm 1 isǫ-differential private after any number oftof plays.

Proof. This follows directly from the fact that the hybrid mechanism isǫ-DP after any numbertof plays and a single flip of one reward in the sequence of rewards only affect one mechanism. Furthermore, the whole algorithm is a random mapping from the output of the hybrid mechanism to the action taken and using Proposition 2.1 of (Dwork and Roth 2013) completes the proof.

Theorem 3.2 gives the regret for algorithm 1. While here

(5)

we give only a sketch proof, the complete derivation can be found in the supplementary material (Tossou and Dim- itrakakis 2016).

Theorem 3.2. If Algorithm 1 is run withK arms having arbitrary reward distributions, then, its expected regretR after any numbertof plays is bounded by:

R ≤ X

a:µa

max

B(lnB+ 7), 8 λ20a

logt

+ X

a:µa

a+2π2a

3

(3.1)

B =

√8

ǫ(1−λ0)·ln(4t4)

for anyλ0such that0 < λ0<1whereµ1, . . . ,µK are the expected values ofP1, . . . ,PKanda−µa.

Proof Sketch. We used the bound on the hybrid mecha- nism defined in equation 2.2 together with the union and Chernoff-Hoeffding bounds. We then select the error proba- bilityγat each step to bet4. This leads to a transcendental inequality solved using the Lambert W function and approx- imated using section3.1of (Barry et al. 2000).

3.2 The DP-UCB Algorithm

The key observation used in Algorithm 2 is that if at each time step we insert a reward to all hybrid mechanisms, then the scale of the noise will be the same. This means that there is no need anymore to compensate an additional bound.

More precisely, every time we play an armat =aand re- ceive the rewardrt, we not only add it to the hybrid mecha- nism corresponding to armabut we also add a reward of0to the hybrid mechanism of all other arms. As these calculate a sum, it doesn’t affect subsequent calculations.

Theorem 3.3 shows the validity of this approach by demonstrating a regret bound with only an additional factor ofO(ǫ2log2logt)to the optimal non private regret.

Algorithm 2DP-UCB

Run Algorithm 1 and setνato0

When armat=ais played, insert0to all hybrid mech- anisms corresponding to arm a 6= a (Do not increase na,t)

Theorem 3.3. If Algorithm 2 is run withK arms having arbitrary reward distributions, then, its expected regretR after any numbertof plays is bounded by:

R ≤ X

a:µa

a

max

C2(lnC+ 7)2, 8

2a logt

+ X

a:µa

(∆a+ 4∆aζ(1.5))

C= 56(2 +√ 3.5)√

logt ǫ

whereζdenotes the Riemann zeta function.

Proof Sketch. The proof is similar to the one for Theorem 3.2, but we have to choose the error probability to bet3.5.

3.3 The DP-UCB-I

NT

Algorithm

Both Algorithms 1 and 2 enjoy a logarithmic regret with only a small additional factor in the time steptto the optimal non-private regret. However, this includes a multiplicative factor ofǫ1andǫ2respectively. Consequently, increasing privacy scales the total regret proportionally. A natural ques- tion is whether or not it is possible to get a differentially pri- vate algorithm with only anadditiveconstant to the optimal regret. Algorithm 3 answers positively to this question by using novel tricks to achieve differential privacy. Looking at regret analysis of Algorithms 1 and 2, we observe that by adding noise proportional toǫ, we will get a multiplicative factor to the optimal. In other words, to remove this factor, the noise should not depend onǫ. But how can we getǫ-DP in this case?

Note that if we compute and use the mean at each time step with anǫna,t-DP algorithm, then after time stept, our overall privacy is roughly the sum E of all ǫna,t. We then change the algorithm so that it only uses a released mean once every 1ǫ times, making privacyǫE. In any case,ǫna,t

needs to decrease, at least asna,t1, for the sum to be bounded bylogna,t. However,ǫna,t should also be big enough such that the noise added keeps the UCB confidence interval used at the same order, otherwise, the regret will be higher.

A natural choice forǫna,t is a p-series. Indeed, by mak- ingǫna,t to be of the form 1

nv/2a,t , wherena,tis the number of times actionahas been played until timet, its sum will converge to the Riemann zeta function whenvis appropri- ately chosen. This choice of ǫna,t leads to the addition of a Laplace noise of scale 1

n1−v/2a,t to the mean (See Lemma 3.1). Now our trade-off issue between high privacy and low regret is just reduced into choosing a correct value forv. In- deed, we can pickv >2, for the privacy to converge; but the noise added at each time step will be increasing and greater than the UCB bound; which is not desirable. To overcome this issue, we used the more sophisticatedk-fold adaptive composition theorem (III-3 in (Dwork, Rothblum, and Vad- han 2010)). Roughly speaking, this theorem shows that our overall privacy after releasing the mean a number of times depends on the sum of thesquareof each individual privacy parameterǫna,t. So,v > 1is enough for convergence and withv ≤ 1.5, the noise added will be decreasing and will eventually become lower than the UCB bound.

In summary, we just need to lazily update the mean of each arm every 1ǫ times. However, we show that the inter- val of release is much better than 1ǫ and follows a seriesf as defined by Lemma (B.1) in the supplements (Tossou and Dimitrakakis 2016). Algorithm3 summarizes the idea de- veloped in this section.

The next lemma establishes the privacyǫeach time a new mean is released for a given arma.

Lemma 3.1. The meana computed by Algorithm 3 for a

(6)

Algorithm 3DP-UCB-INT(ǫ,v,K,A) Inputǫ∈(0,1]v∈(1,1.5]; privacy rate.

Kis the number of arms andAthe set of all arms.

f ← ⌈1ǫ⌉;xˆ←0

(For simplicity, we take the intervalf to be⌈1ǫ⌉here) fort←1toTdo

ift≤Kfthen

play arma= (t−1) modK+ 1and observert

else

for alla∈ Ado

ifna,t modf = 0then ˆ

xansa,ta +Lap(0, 1

n1−v/2a,t ) +q

2 logt na,t

end if end for

Pull armat= arg maxaaand observert

Update sumsa ←sa+rt. end if

end for

given arm a at each interval isna,tv/2-differential private with respect to the reward sequence observed by that arm.

Proof. Sketch This follows directly from the fact that we add Laplace noise of scalenv/2a,t1.

The next theorem establishes the overall privacy after hav- ing played forttime steps.

Theorem 3.4. After playing for anyttime steps, Algorithm 3 is, δ)-differential private with

ǫ≤min

t

X

n=1

√ǫ nv, ǫ

t

X

n=1

e1nv −1

√tv + v u u tǫ

t

X

n=1

2 lnδ1

nv

for anyδ∈(0,1],ǫ∈(0,1]

Proof Sketch. We begin by using similar observations as in Theorem 3.1. Then, we compute the privacy of the mean of an arm using thek-fold adaptive composition theorem in (Dwork, Rothblum, and Vadhan 2010) (see the supplements (Tossou and Dimitrakakis 2016)).

The next corollary gives a nicer closed form for the pri- vacy parameter which is needed in practice.

Corollary 3.1. After playing forttime steps, Algorithm 3 is, δ)-differential private with

ǫ≤min

ǫt1v/2−v/2

1−v/2 ,2ǫζ(v) +p

2ǫζ(v) ln(1/δ)

withζ the Riemann Zeta Function for anyδ ∈ (0,1],ǫ ∈ (0,1],v∈(1,1.5].

Proof Sketch. We upper bounded the first term in theorem 3.4 by the integral test, then for the second term we used ex≤1 + 2xfor allx∈[0,1]to conclude the proof.

The following corollary gives the parameterǫwith which one should run Algorithm 3 to achieve a givenǫprivacy.

Corollary 3.2. If you run Algorithm 3 with parameterǫ= r

logδ′1+4ǫ 8ζ(v)

r

logδ′1 8ζ(v)

!2

for anyδ ∈(0,1],ǫ ∈(0,1], v∈(1,1.5], you will be at least(ǫ, δ)-differential private.

Proof. The proof is obtained by inverting the term using the Riemann zeta function in corollary 3.1.

Finally, we present the regret of Algorithm 3 in theorem 3.5 . A simple observation shows us that it has the same regret as the non private UCB with just an additive constant.

Theorem 3.5. If Algorithm 3 is run withK arms having arbitrary reward distributions, then, its expected regret R after any numbertof plays is bounded by:

R ≤ X

a:µa

a

f0+ 8

2a logt+ 1 + 4ζ(1.5)

where f01ǫ. More precisely, f0 is the first value of the seriesf defined in Lemma B.1 in the supplements (Tossou and Dimitrakakis 2016).

Sketch. This is proven using a Laplace concentration in- equality to bound the estimate of the mean then we selected the error probability to bet3.5.

4 Experiments

We perform experiments using arms with rewards drawn from independent Bernoulli distribution. The plot, in log- arithmic scale, shows the regret of the algorithms over 100,000time steps averaged over 100 runs. We targeted 2 differentǫprivacy levels : 0.1 and 1. For DP-UCB-INT, we pickǫsuch that the overall privacy is(ǫ, δ)-DP withǫas de- fined in corollary 3.2 andδ=e10∈ {0.1,1}, v= 1.1.

We put in parenthesis theinputprivacy of each algorithm.

We compared against the non private UCB algorithm and the algorithm presented in (Mishra and Thakurta 2015) (Private-UCB) with a failure probability chosen to bet4.

We perform two scenarios. Firstly we used two arms: one with expectation 0.9 and the other 0.6. The second scenario is a more challenging one with 10 arms having all an expec- tation of 0.1 except two with 0.55 and 0.2.

As expected, the performance of DP-UCB-INT is sig- nificantly better than all other private algorithms. More im- portantly, the gap between the regret of DP-UCB-INTand the non private UCB does not increase with time confirming the theoretical regret. We can notice that DP-UCB is better than DP-UCB-BOUND for small time steps. However, as the time step increases DP-UCB-BOUNDoutperforms DP- UCB and eventually catches its regret. The reason for that is: DP-UCB spends less time to distinguish between arms with close rewards due to the fact that the additional factor in its regret depends on∆a−µawhich is not the case for DP-UCB. Private-UCB performs worse than all other algorithms which is not surprising.

(7)

0 20000 40000 60000 80000 100000 time step

101 100 101 102 103 104 105

Averageregret

UCB

DP-UCB-INT (0.00396431694945, 1.1) DP-UCB (1.0)

DP-UCB-BOUND (1.0) Private-UCB, Mishra et al 2015 (1.0)

(a) Regret forǫ= 1, 2 arms

0 20000 40000 60000 80000 100000

time step 101

100 101 102 103 104 105

Averageregret

UCB

DP-UCB-INT (0.000253053596336, 1.1) DP-UCB (0.1)

DP-UCB-BOUND (0.1) Private-UCB, Mishra et al 2015 (0.1)

(b) Regret forǫ= 0.1, 2 arms

0 20000 40000 60000 80000 100000

time step 101

100 101 102 103 104 105

Averageregret

UCB

DP-UCB-INT (0.00396431694945, 1.1) DP-UCB (1.0)

DP-UCB-BOUND (1.0) Private-UCB, Mishra et al 2015 (1.0)

(c) Regret forǫ= 1, 10 arms

0 20000 40000 60000 80000 100000

time step 101

100 101 102 103 104 105

Averageregret

UCB

DP-UCB-INT (0.000253053596336, 1.1) DP-UCB (0.1)

DP-UCB-BOUND (0.1) Private-UCB, Mishra et al 2015 (0.1)

(d) Regret forǫ= 0.1, 10 arms

Figure 2: Experimental results with 100 runs, 2 or 10 arms with rewards:{0.9,0.6}or{0.1. . .0.2,0.55,0.1. . .}.

Moreover, we noticed that the difference between the best regret (after 100 runs) and worst regret is very consistent for all ours algorithms and the non private UCB (it is under 664.5 for the 2 arms scenario). However, this gap reaches 30,000for Private-UCB. This means that our algorithms are able to correctly trade-off between exploration and exploita- tion which is not the case for Private-UCB.

5 Conclusion and Future Work

In this paper, we have proposed and analysed differen- tially private algorithms for the stochastic multi-armed ban- dit problem, significantly improving upon the state of the art. The first two, (DP-UCB and DP-UCB-BOUND) are variants of an existing private UCB algorithm (Mishra and Thakurta 2015), while the third one uses an interval-based mechanism.

Those first two algorithms are only within a factor of O(ǫ1log logt) and O(ǫ2log2logt) to the non-private algorithm. The last algorithm, DP-UCB-INT, efficiently

trades off the privacy level and the regret and is able to achieve the same regret as the non-private algorithm up to an additionaladditiveconstant. This has been achieved by using two key tricks: updating the mean of each arm lazily with a frequency proportional to the privacyǫ1and adding a noise independent ofǫ. Intuitively, the algorithm achieves better privacy without increasing regret, because its output is less dependent on individual reward.

Perhaps it is possible to improve our bounds further if we are willing to settle for asymptotically low regret (Cowan and Katehakis 2015). A natural future work is to study if we can use similar methods for other mechanisms such as Thompson sampling (known to be differentially pri- vate (Dimitrakakis et al. 2014)) instead of UCB. Another question is whether a similar analysis can be performed for adversarialbandits.

We would also like to connect more to applications by two extensions of our algorithms. The first natural extension is to consider some side-information. In the drug testing exam-

(8)

ple, this could include some information about the drug, the test performed and the user examined or treated. The second extension would relate to generalising the notion of neigh- bouring databases to take into account the fact that multiple observations in the sequence (saym) can be associated with a single individual. Our algorithms can be easily extended to deal with this setting (by re-scaling the privacy parameter to

ǫ

m). However, in practice,mcould be quite large and it will be an interesting future work to check if we could get sub linearity in the parametermunder certain conditions.

References

Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite time analysis of the multiarmed bandit problem. Machine Learning47(2/3):235–256.

Barry, D.; Parlange, J.-Y.; Li, L.; Prommer, H.; Cunning- ham, C.; and Stagnitti, F. 2000. Analytical approximations for real values of the lambert w-function. Mathematics and Computers in Simulation53(12):95 – 103.

Burnetas, A. N., and Katehakis, M. N. 1996. Optimal adap- tive policies for sequential allocation problems.Advances in Applied Mathematics17(2):122–142.

Calandrino, J. A.; Kilzer, A.; Narayanan, A.; Felten, E. W.;

and Shmatikov, V. 2011. ”you might also like: ” privacy risks of collaborative filtering. In32nd IEEE Symposium on Security and Privacy, 231–246.

Chan, T. H.; Shi, E.; and Song, D. 2010. Private and contin- ual release of statistics. InAutomata, Languages and Pro- gramming. Springer. 405–417.

Cowan, W., and Katehakis, M. N. 2015. Asymptotic behavior of minimal-exploration allocation policies: Al- most sure, arbitrarily slow growing regret. arXiv preprint arXiv:1505.02865.

Dimitrakakis, C.; Nelson, B.; Mitrokotsa, A.; and Rubin- stein, B. 2014. Robust and private Bayesian inference. In Algorithmic Learning Theory.

Dwork, C., and Roth, A. 2013. The algorithmic foundations of differential privacy. Foundations and Trends in Theoreti- cal Computer Science9(34):211–407.

Dwork, C.; McSherry, F.; Nissim, K.; and Smith, A. 2006.

Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptog- raphy, TCC’06, 265–284.

Dwork, C.; Rothblum, G. N.; and Vadhan, S. 2010. Boosting and differential privacy. InProceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Sci- ence, FOCS ’10, 51–60.

Dwork, C. 2006. Differential privacy. InICALP, 1–12.

Springer.

Jain, P.; Kothari, P.; and Thakurta, A. 2012. Differen- tially private online learning. In Mannor, S.; Srebro, N.; and Williamson, R. C., eds.,COLT 2012 - The 25th Annual Con- ference on Learning Theory, volume 23, 24.1–24.34.

Korolova, A. 2010. Privacy violations using microtargeted ads: A case study. InICDMW 2010, The 10th IEEE Interna- tional Conference on Data Mining Workshops, 474–482.

Lai, T. L., and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1):4–22.

Mishra, N., and Thakurta, A. 2015. (nearly) optimal differ- entially private stochastic multi-arm bandits.Proceedings of the 31th International Conference on Conference on Uncer- tainty in Artificial Intelligence (UAI-2015).

Pandey, S., and Olston, C. 2006. Handling advertisements of unknown quality in search advertising. In Sch¨olkopf, B.;

Platt, J. C.; and Hoffman, T., eds.,Advances in Neural In- formation Processing Systems 19, Proceedings of the Twen- tieth Annual Conference on Neural Information Processing, 1065–1072.

Robbins, H., et al. 1952. Some aspects of the sequential de- sign of experiments.Bulletin of the American Mathematical Society58(5):527–535.

Thakurta, A. G., and Smith, A. D. 2013. (nearly) optimal al- gorithms for private online learning in full-information and bandit settings. InAdvances in Neural Information Process- ing Systems 26: 27th Annual Conference on Neural Infor- mation Processing Systems 2013., 2733–2741.

Thompson, W. 1933. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of two Samples. Biometrika25(3-4):285–294.

Tossou, A., and Dimitrakakis, C. 2016.

Supplementary Materials for Algorithms for Differentially Private Multi-Armed Bandits.

https://www.dropbox.com/s/fotezpnx49nz0i3/single- mab-aaai16-sups-final.pdf. [Online; accessed 1-December- 2015].

Zhao, J.; Jung, T.; Wang, Y.; and Li, X. 2014. Achieving differential privacy of data disclosure in the smart grid. In 2014 IEEE Conference on Computer Communications, IN- FOCOM 2014, 504–512.

A Collected proofs A.1 Proof of Theorem 3.2

awill be used to indicate the index of an arm.ǫis the differ- ential privacy parameter.tis used to denote the time step.

From Lemma (B.2), we know that the error between the empirical and private mean is bounded as|Y−X| ≤hnwith probability at least1−γ,Xis the empirical mean returned by the private mechanism, Y the true empirical mean,hn

the error due to the differentially private mechanism. It is defined as:hn= 1ǫ·√

8(logn)·ln4γ·n1+1ǫ·√

8·ln4γ·n1. We can rewrite this bound into equations A.1 and A.2.

Pr(X ≥Y +hn)≤γ (A.1) Pr(X ≤Y −hn)≤γ. (A.2) LetTa(s)be the number of times arm a is played in the firststime steps. Let’sct,n,p

(2 lnt)/ndenote the origi- nal UCB confidence index.

By following similar steps as in the demonstration of

(9)

UCB in (Auer, Cesa-Bianchi, and Fischer 2002), we have Ta(s) = 1 +

s

X

t=K+1

{at=a}

≤ℓ+

X

t=1 t1

X

n=1 t1

X

na=ℓ

{Xn+ct,n+hn≤Xa,na+ct,na+hna} (A.3) In equation A.3,Xnis the mean returned by the private mechanism for the best arm when it has been playedntimes.

Now we can observe thatXn+ct,n+hn≤Xa,na+ct,na+ hnaimplies that at least one of the following must hold

Xn≤µ−ct,n−hn (A.4) Xa,na≥µa+ct,na+hna (A.5) µ< µa+ 2ct,n+ 2hn (A.6) We can bound the probability of events (A.4) using equa- tion (A.2), the union bound and the Chernoff-Hoeffding bound.

Pr(A.4) = Pr(Xn≤µ−ct,n−hn)

= Pr(Xn≤Yn−hn∨Yn≤µ−ct,n)

≤Pr(Xn≤Yn−hn) + Pr(Yn≤µ−ct,n)

≤γ+ exp(−4 logt) =γ+t4. (A.7) Similarly, to prove a bound on the probability (A.5) we use (A.1) , the union bound and the Chernoff-Hoeffding bound.

Pr(A.5) = Pr(Xa,na≥µa+ct,na+hna)

= Pr(Xa,na≥Ya,na+hna∨Ya,na≥µa+ct,na)

≤Pr(Xa,na≥Ya,na+hna) + Pr(Ya,na ≥µa+ct,na)

≤γ+ exp(−4 logt) =γ+t4. (A.8) Let’s chooseγ=t4; this leads respectively to

Pr(A.4)≤2t4 (A.9) Pr(A.5)≤2t4 (A.10) Now consider the last condition (A.6). For this, we want to find the minimum numbernfor which event (A.6) is always false. Event (A.6) is false, implies that∆a >2ct,n+ 2hn

where∆a−µa. We observe that for∆a >2ct,n+2hn

to hold, it is enough that the following two conditions hold for anyλ0such that0< λ0<1.

λ0a>2ct,n (A.11) (1−λ0)∆a>2hn (A.12) From inequality (A.11), we have

n≥ 8

λ202a log(t). (A.13) Equation (A.12) leads to

n≥B·log(n) +B

with

B=

√8

ǫ(1−λ0)∆a ·ln4 γ

=

√8

ǫ(1−λ0)∆a ·ln(4t4)

We can rewrite the inequality in a more familiar form:

e(B1)n≥e·n

This is a standard transcendental algebraic inequality whose solution is given by the LambertW function. So,

n≥ −B·W −1

e·B,−1

whereW(x, k)is the Lambert function ofxon branchk.

Note here that the branch is -1 and because e1 < e·B1 <

0 ∀t >1, we are always guaranteed to find a real number.

By using the approximation of the Lambert function pro- vided in section3.1of (Barry et al. 2000), we can conclude that

n≥ −B·W −1

e·B,−1

≈ −B

ln( 1

e·B)− 2 0.3361

≥B(lnB+ 7)

=

√8·ln(4t4) ǫ(1−λ0)∆a ln

√8·ln(4t4) ǫ(1−λ0)∆a

! + 7

!

(A.14) Combining inequalities (A.13) and (A.14) yields,

n≥max

B(ln (B) + 7), 8

λ202a log(t)

In summary, Ta(s)≤

max

B(lnB+ 7), 8

λ202a log(t)

+ X

t=1 t1

X

n=1 t1

X

na=ℓ

4t4

≤1 + max

B(lnB+ 7), 8 λ202alogt

+

X

t=1

4t2

≤1 + max

B(lnB+ 7), 8 λ202alogt

+2π2

3 (A.15) which concludes the proof.

A.2 Proof for Theorem 3.3

The proof of this theorem is similar to the one for theorem 3.5 withhn= 1ǫ·√

8(logn)·ln4γ·n1+1ǫ·√

8·ln4γ·n1. We make the same choice ofλandγ. However, the minimum numberncompatible with these choices lead a transcenden- tal equations in the same form as the one in the proof of theorem 3.2.

(10)

B Proofs for UCB-Interval Algorithm

Fact B.1. Differential privacy of the Laplace mecha- nism (See Theorem 4 in (Dwork and Roth 2013)) For any real functiong of the data, a mechanism adding Laplace noise with scale parameterβis∆g/β-differentially private, where∆gis theL1sensitivity ofg.

B.1 Proof of Lemma 3.1

Proof. Indeed, for each arm, we are adding a Laplace noise of mean 0 and scale nv/2a,t1 wherena,t is the number of times this arm has been played. As the sensitivity of the mean is n1a,t, we use the differential privacy of the Laplace Mechanism (Fact B.1) to conclude the proof.

B.2 Proof of Theorem 3.4

Similarly to the proof of Theorem 3.1, the overall privacy of Algorithm 3 will be the same as the overall privacy of the Mechanism computing the mean of the rewards received from a single arm.

A new Laplace Mechanism is used to compute the mean of each arm. However a new mean is only releasedt/ftimes (everyftime steps) afterttimes steps wherefis the interval used.

According to the k-fold adaptive composition theorem (III-3 in (Dwork, Rothblum, and Vadhan 2010)), Algorithm 3 will be(ǫ, δ)differential private for anyδ ∈(0,1]with

ǫ≤min{C, D}

C= X

n∈Ina1

ǫ

√1 nv

D= X

n∈Ina1

ǫ

√1

nv(e1nv −1) + v u u t

X

n∈Ina1

ǫ

2

nvln(1/δ)

whereIn1a ǫ

={1ǫ,2ǫ,3ǫ,· · ·na} We have

D=

 X

n∈Ina1

ǫ

√1

nv(e1nv −1) + v u u t

X

n∈Ina1

ǫ

2

nvln(1/δ)

(B.1)

 X

n∈It1

ǫ

√1

nv(e1nv −1) + v u u t

X

n∈It1

ǫ

2

nvln(1/δ)

(B.2)

ǫ

t

X

n=1

√1

nv(e1nv −1) + v u u tǫ

t

X

n=1

2

nvln(1/δ)

 (B.3)

C= X

n∈Ina1

ǫ

√1

nv (B.4)

≤ X

n∈It1

ǫ

√1

nv (B.5)

≤ǫ

t

X

n=1

√1

nv (B.6)

which concludes the proof.

Lemma B.1(Values of the interval of release of the means).

The intervalf by which we should update the mean follows a series such thatfn =Wn+1−Wnwith:





















W0= 0

Wn+1is eitherxorywith x= inf

xN

x≥Wn+ 1 :

x

X

i=Wn+1

√1

iv ≥ 1 ǫ√

xv

y= inf

yN

y≥Wn+ 1 :

y

X

i=Wn+1

1 iv ≥ 1

ǫyv

Furthermorefnis always such thatfn1

ǫ

Note that only one of the expression forWn+1should be used up to the horizon T.

Proof. From the proof of theorem 3.4, we can easily see that we can pass from lines (B.2) to (B.3) if the conditions ony in Lemma B.1 is verified. Similarly, we can pass from lines (B.5) to (B.6) if the conditions ony in Lemma B.1 is ver- ified. In both cases, 1ǫ is an upper bound on the number x−Wn andy −Wn as the original interval used in both lines (B.2) and (B.5) is 1ǫ.

B.3 Proof of Corollary 3.1

From theorem 3.4, we know that Algorithm 3 will be(ǫ, δ) differential private for anyδ∈(0,1],1< v≤1.5with

ǫ≤min{C, D}

C=

t

X

n=1

√ǫ nv

D=ǫ

t

X

n=1

e1nv −1

√tv + v u u tǫ

t

X

n=1

2 lnδ1

nv

It is easy to get an upper bound forC using the integral test inequality giving:

C≤ǫt1−v/12v/2v/2

We will now simplifyDusing a standard approximation for exponential function.

(11)

D=ǫ

t

X

n=1

e1nv −1

√tv + v u u tǫ

t

X

n=1

2 lnδ1

nv (B.7)

ǫ

X

t=1

√1

tv(e1tv −1) + v u u tǫ

X

t=1

2

tvln(1/δ)

 (B.8)

ǫ X

t=1

2( 1

√tv)2+ v u u tǫ

X

t=1

2

tvln(1/δ)

 (B.9)

=

2ǫζ(v) +p

2ǫζ(v) ln(1/δ)

(B.10) whereζis the Riemann zeta function.

B.4 Proof of Corollary 3.2

The proof is immediate from corollary 3.1. It is obtained by inverting the term using the Riemann zeta function in corol- lary 3.1.

B.5 Proof of Theorem 3.5

Fact B.2 (Fact 3.7 in (Dwork and Roth 2013)). If Y ∼ Lap(b)then

Pr(|Y|≥bln1

γ) =γ. (B.11)

First, let’s ignore the effect of computing the means per interval and assume that we compute it at each time step after playing each armf0 times wheref01ǫ is the first value of the seriesf defined in lemma (B.1). As much of the proof is quite similar to that of Theorem 3.2, we shall omit some steps.

Similarly to the proof of Theorem 3.2, we will take a bad arm when one of the following 3 events happens:

Xn≤µ−ct,n (B.12) Xa,na≥µa+ct,na (B.13) µ< µa+ 2ct,n (B.14) Since we are adding Laplace noise, we can use Fact B.2 to show that:

Pr{|Y −X| ≥log(1

γ)nv/21} ≤γ (B.15) Lethn= log(1γ)nv/21. Now we can bound the probability of events (B.13) using equation (B.15), the union bound and the Chernoff-Hoeffding bound. In order to be more precise, we useXa,naforxˆaandY to denote the empirical meanx¯a

at timet, whilect,na=p

2 log(t)/naas before.

Pr(B.13) = Pr(Xa,na≥µa+ct,na)

= Pr(Xa,na≥Ya,na+hna∨Ya,na≥µa+ct,na−hna)

≤Pr(Xa,na≥Ya,na+hna) + Pr(Ya,na ≥µa+ct,na−hna)

≤γ+ Pr(Ya,na ≥µa+ct,na−hna). (B.16)

≤γ+ exp(−2na(ct,na−hna)2) (B.17)

≤γ+t3.5 (B.18)

≤2t3.5 (B.19)

In the above, we choosehna ≤λ4ct,nawithλ4= 1−3.52

andγ=t3.5

Similarly, we prove a bound on the probability (B.12):

Pr(B.12) = Pr(Xn≤µ−ct,n)

= Pr(Xn≤Yn−hna∨Yn≤µ−ct,n+hna)

≤Pr(Xn≤Yn−hna) + Pr(Yn≤µ−ct,n+hna)

≤γ+t3.5= 2t3.5. (B.20) Now, Let’s compute the minimum value forna which is consistent with our choice of λ4andγ. It is easy to show thatna ≥(3.522 log3.5t)v−12 . And this number converges to 0 as t increases.

The final event B.14 is exactly the standard UCB event and B.14 will be false for allna82alogt.

Now, we can easily see that the effect of the interval in the algorithm will not change this number. Indeed, because we are only updating the mean eachf steps, the numberna

should be divided byf. Now, after updating the mean, we play this arm for a number of steps lower or equal tof. So, we should multiplyna byf which cancel out the effect of the division.

In summary, (and similarly to Theorem 3.2) Ta(s)≤f0+

&

max ( 2−√ 3.5 3.5√

2 logt)v−12 , 8

2a logt

!' +

X

t=1 t1

X

n=1 t1

X

na=ℓ

4t3.5

≤f0+ 8

2alogt

+ X

t=1

4t1.5

≤f0+ 1 + 8

2alogt+ 4ζ(1.5)

≤f0+ 1 + 8

2alogt+ 4ζ(1.5) (B.21) Lemma B.2(Improved (Chan, Shi, and Song 2010) bound).

For anyγ≤nb, whereb >0, Chan’s hybrid mechanism is ǫ-differential private and has an error bounded with proba- bility at least1−γby ǫ8log(4γ) logn+ǫ8log(γ4).

Proof. Our improvement is based on two observations.

Firstly, ifγ≤nbthen we can use the alternative concentra- tion bound for the sum of the Laplace distribution. Secondly, we note that given all the previous released sums, the current sum is only drawn from one of the two mechanisms. Thus there is no need to apply a composition theorem.

Références

Documents relatifs

In this paper, the uncoordinated spectrum access problem, where each user is allowed to choose N channels at each timeslot, was modeled as a multi-user MAB with multiple plays

In this sense, from the Laplace transform with ψ, it became possible to discuss properties such as existence, uniqueness, stability, controllability of mild solutions of

We also show how one can use this abstract formulation both for giving different and sim- pler proofs for all the known results obtained for the eigenvalues of a power of the

The glass compositions of Types 3, 2 and 1 plot within three distinct CaO ranges, decreasing from Type 3 to 1 (Fig. Type 2 pumices in the Mamaku and Ohakuri de- posits

The first term is due to the difference in the mean–variance of the best arm and the arms pulled by the algorithm, while the second term denotes the additional variance introduced

This result illustrates the fact that in this active learning problem (where the goal is to estimate the mean values of the arms), the performance of the algorithms that rely on

1) the extraction of POIs from the real mobility traces. We consider these POIs as the ground truth of our study and call these POIs the real POIs. We describe our POIs’

This includes the innovative launching system based on water rocket, the design of a 5-hole probe for wind and turbulence measurements, the new version of the on-board autopilot