Learning Modular Safe Policies in the Bandit Setting with Application to Adaptive Clinical Trials

(1)

Learning Modular Safe Policies in the Bandit Setting with Application to Adaptive Clinical Trials

Hossein Aboutalebi

^∗1

, Doina Precup

¹

, and Tibor Schuster

²

1

Department of Computer Science, McGill University. Mila Quebec AI Institute

2

Department of Family Medicine, McGill University

Abstract

The stochastic multi-armed bandit problem is a well-known model for studying the exploration- exploitation trade-off. It has significant possible applications in adaptive clinical trials, which allow for dynamic changes in the treatment allocation prob- abilities of patients. However, most bandit learning algorithms are designed with the goal of minimiz- ing the expected regret. While this approach is useful in many areas, in clinical trials, it can be sensitive to outlier data, especially when the sample size is small. In this paper, we define and study a new ro- bustness criterion for bandit problems. Specifically, we consider optimizing a function of the distribution of returns as a regret measure. This provides practitioners more flexibility to define an appropri- ate regret measure. The learning algorithm we pro- pose to solve this type of problem is a modification of the BESA algorithm [Baransiet al., 2014], which considers a more general version of regret.

We present a regret bound for our approach and evaluate it empirically both on synthetic problems as well as on a dataset from the clinical trial literature. Our approach compares favorably to a suite of standard bandit algorithms. Finally, we provide a web application where users can create their de- sired synthetic bandit environment and compare the performance of different bandit algorithms online.

Introduction

The multi-armed bandit is a standard model for researchers to investigate the exploration-exploitation trade-off, see e.g [Baransiet al., 2014; Aueret al., 2002; Saniet al., 2012a;

Chapelle and Li, 2011; Sutton and Barto, 1998]. One of the main advantage of multi-armed bandit problems is its simplicity that allows for a higher level of theoretical studies.

The multi-armed bandit problem consists of a set of arms, each of which generates a stochastic reward from a fixed but unknown distribution associated to it. Consider a series of mulitple arm pulls (or steps)t= 1, ..., T and selecting a specific arma∈ Aat each step i.e.a(t) =at. The standard goal

∗hossein.aboutalebi@mail.mcgill.ca

in the multi-armed bandit setting is to find the arm?which has the maximum expected rewardµ?(or equivalently, minimum expected regret). The expected regret afterT stepsR_T is defined as the sum of the expected difference between the mean reward under{a_t}and the reward expected under the optimal arm?:

RT =E

" _T X

t=1

(µ?−µa_t)

#

While this objective is very popular, there are practi- cal applications, for example in medical research and AI safety [Garcıa and Fern´andez, 2015] where maximizing expected value is not sufficient, and it would be better to have an algorithm sensitive also to the variability of the outcomes of a given arm. For example, consider multi-arm clinical trials where the objective is to find the most promising treatment among a pool of available treatments. Due to heterogeneity in patients’ treatment responses, considering only the expected mean may not be of interest [Austin, 2011]. Specifically, as the mean is usually sensitive to outliers and does not provide information about the dispersion of individual responses, the expected reward has only limited value in achieving a clinical trial’s objective. Due to these problems, previous contributions like [Saniet al., 2012a] try to include the variance of rewards in the regret definition and develop algorithms to solve this slightly enhanced problem. While these modified approaches try to consider variablity in the response of arms, they induce new problems due to the fact that the variance is not necessarily a good measure of variablity for a distribution. This is because the variance equally penalizes responses that are above or below the mean response. Other articles like [Galichetet al., 2013] try to use the conditional value at risk to define a better regret definition. Though the conditional value at risk may address the problem we faced with including variance, it may not reflect the amount of variablity we could observe for a distribution over its entire domain. All in all, the consistency of treatments among patients is essential, with the ideal treatment usually defined as the one which has a high positive response rate while showing low variability in response among patients. Thus, the idea of consistency and saftey seems to some extent subjective and problem depen- dant. As a result, it might be necessary to develop an algorithm which can work with an arbitrary definition of consistency for a distribution.

(2)

This kind of system design which allows the separation of different parts of a system (here regret function and learning algorithm) has already been explored in modular programming. In modular programming, we emphasize on splitting the entire system into independant modules which at the end, the composite of these modules builds our system. This design trick is necessary when we are dealing with the change of customer demands and we require our system to adapt with the new demands. Here, we follow the same paradigm by making regret definition independent of the learning algorithm. As a result, we allow more flexibility in defining the regret function which is capable of incorporating problem specific demands.

Finally, we achieve the aforementioned goals by extend- ing one of the recent algorithms in the bandit literature called BESA (Best Empirical Sampled Average) [Baransi et al., 2014]. One of the main advantage of BESA compared to other existing bandit algorithms is that it does not involve many hyper-parameters. This is especially useful when one does not have any prior knowledge or has insufficient prior knowledge about the different arms in the beginning. Also, this feature makes it easier to introduce modular design by using McDiarmid’s Lemma [El-Yaniv and Pechyony, 2009].

Key contributions:We provide a modular definition of regret called safety-aware regret which allows higher flexibility in defining the risk for multi-armed bandit problems. We pro- pose a new algorithm called BESA+ which solves this category of problems. We show the upper-bounds of its safety- aware regret for two-armed and multi-armed bandits. For the experiment parts, we compare our model with some of the notable earlier research works and show that BESA+ has a satisfying performance. For the last experiment, we depict the performance of our algorithm on a real clinical dataset and illustrate that it is capable of solving the problem with user-defined safety-aware regret. Finally, for the first time as far as we know, we provide a web application which allows users to create their own custom environment and compare our algorithm with other works.

Background and Notation

We consider the standard bandit setting with action (arm) setA, where each actiona∈ Ais characterized by a reward distributionϕ_a. The distribution for actionahas meanµ_aand varianceσ_a². LetXa,i ∼ϕa denote thei-th reward sampled from the distribution of actiona. All actions and samples are independent. The bandit problem is described as an iterative game where, on each step (round)t, the player (an algorithm) selects action (arm)a_tand observes sampleX_a,N_a,t, where Na,t = Pt

s=1I{as = a} denotes the number of samples observed for actionaup to timet(inclusively). A policy is a distribution overA. In general, stochastic distributions are necessary during the learning stage, in order to identify the best arm. We discuss the exact notion of “best” below.

We defineIS(m, j) as the set obtained by sub-sampling without replacement j elements form the set S of size m.

LetXa,tdenote the history of observations (records) obtained from action (arm) a up to time t (inclusively), such that

|Xa,t|=Na,t. The notationXa,t(I)indicates the set of sub-

samples fromXa,t, where sub-sampleI ⊂ {1,2, . . . , Na,t}.

The multi-armed bandit was first presented in the sem- inal work of Robbins [Robbins, 1985]. It has been shown that under certain conditions [Burnetas and Katehakis, 1996;

Lai and Robbins, 1985], a policy can have logarithmic cumu- lative regret:

t→∞lim inf R_t

log(t)> X

a:µa<µ?

µ_?−µ_a Kinf(ra;r?)

where Kinf(ra;r?) is the Kullback-Leibler divergence between the reward distributions of the respective arms. Policies for which this bound holds are calledadmissible.

Several algorithms have been shown to produce admissible policies, including UCB1 [Auer et al., 2002], Thomp- son sampling [Chapelle and Li, 2011; Agrawal and Goyal, 2013] and BESA [Baransiet al., 2014]. However, theoretical bounds are not always matched by empirical results. For example, it has been shown in [Kuleshov and Precup, 2014]

that two algorithms which do not produce admissible policies,ε-greedy and Boltzmann exploration [Sutton and Barto, 1998], behave better than UCB1 on certain problems. Both BESA and Thompson sampling were shown to have compa- rable performance with Softmax andε-greedy.

While the expected regret is a natural and popular measure of performance which allows the development of theoretical results, recently, some papers have explored other definitions for regret. For example, [Saniet al., 2012b] consider a linear combination of variance and mean as the definition of regret for a learning algorithmA:

M Vd_t(A) =bσ²_t(A)−ρbµ_t(A) (1) wherebµ_tis the estimate of the average of observed rewards up to time steptandσbtis a biased estimate of the variance of rewards up to time stept. The regret is then defined as:

R_t(A) =M Vd_t(A)−M Vd_?,t(A),

where? is the optimal arm. According to [Maillard, 2013], however, this definition is going to penalize the algorithm if it switches between optimal arms. Instead, in [Maillard, 2013], the authors devise a new definition of regret which controls the lower tail of the reward distribution. However, the algorithm to solve the corresponding objective function seems time-consuming, and the optimization to be performed may be intricate. Finally, in [Galichetet al., 2013], the authors use the notion of conditional value at risk in order to define the regret.

Measure of regret

Unlike previous works, we now give a formal definition of class of functions which can be used as a separate module inside our learning algorithm module to measure the regret.

We call these class of functions ”safety value functions”.

In the following section, we try to formally define these functions. Assume we havek arms (|A| = k) with reward distributionsϕ₁, ϕ₂, . . . , ϕ_k.

Definition 0.1. safety value function:LetDdenotes the set of all possible reward distributions for a given interval. The

(3)

safety value functionv:D → Rprovides a score for a given distribution.

The optimal arm?under this value function is defined as

?∈arg max

a∈A(v(ϕ_a)) (2)

The regret corresponding to the safety value function up to timeT is defined as:

RT ,v=E

" _T X

t=1

(v(ϕ?)−v(ϕa_t))

#

(3)

We call(3), safety-aware regret.

When the context is clear, we usually drop the subscriptv and use onlyRT for the ease of notation.

Definition 0.2. Well-behaved safety value function:Given a reward distributionϕa over the interval[0,1], a safety value functionvfor this distribution is called well-behaved if there exists an unbiased estimatorbv ofv such that for any set of observation{x1, x2, . . . , xn}sampled fromϕa, and for some constantγwe have:

sup

xb_i

|bv(x1, . . . , xi, . . . , xn)−bv(x1, . . . ,xbi, . . . , xn)|< γ n (4) If (4)holds for any reward distributionϕover the interval [0,1], we call the safety value function v, a well-behaved safety value function.

Example 1:For a given armawhich has reward distribution limited to interval[0,1], consider the safety value func- tionµ_a−ρσ_a²which measures the balance between the mean and the variance of the reward distribution of arma.ρis a hyper-parameter constant for adjusting the balance between variance and the mean. This is a well-behaved safety function if we use the following estimator for computing empirical mean and variance:

bµa,t = 1 N_a,t

Na,t

X

i=1

ra,i (5)

σb²_a,t = 1 N_a,t−1

N_a,t

X

i=1

(ra,i−µba,t)² (6)

wherera,i is theith reward obtained from pulling arma.

It should be clear that the unbiased estimatorµb_a,t −ρbσ_a,t² satisfies (4).

Other types of well-behaved safety function can be defined as a function of standard deviation or conditional value at risk similar to the previous example. In the next section, we are going to develop an algorithm which can optimize the safety- aware regret.

Proposed Algorithm

In order to optimize the safety-aware regret, we build on the BESA algorithm, which we will now briefly review. As discussed in [Baransiet al., 2014], BESA is a non-parametric

(without hyperparameter) approach for finding the optimal arm according to the expected mean regret criterion. Consider a two-armed bandit with actionsaand?,whereµ_?> µ_a, and assume thatNa,t< N?,tat time stept. In order to select the next arm for time stept+ 1, BESA first sub-sampless_? = I?(N?,t, Na,t)from the observation history (records) of the arm?and similarly sub-samples_a =I_a(N_a,t, N_a,t) =X_a,t from the records of arma. Ifbµs_a >µbs_?, BESA chooses arm a, otherwise it chooses arm?.

The main reason behind the sub-sampling is that it gives a similar opportunity to both arms. Consequently, the effect of having a small sample size, which may cause bias in the estimates diminishes. When there are more than two arms, BESA runs a tournament algorithm on the arms [Baransiet al., 2014].

Finally, it is worth mentioning that the proof of the regret bound of BESA uses a non-trivial lemma for which authors did not provide any formal proof. In this paper, we will avoid using this lemma to prove the soundness of our proposed algorithm for a more general regret family. Also, we extend the proof for the multi-armed case which was not provided in the [Baransiet al., 2014].

We are now ready to outline our proposed approach, which we call BESA+. As in [Baransiet al., 2014], we focus on the two-arm bandit. For more than two arms, a tournament can be set up in our case as well.

Algorithm BESA+ two action case

Input: Safety aware value functionvand its estimatebv Parameters: current time step t, actions a andb. Initially Na,0= 0, Nb,0= 0

1: ifN_a,t−1= 0∨N_a,t−1<log(t)then 2: at=a

3: else ifN_b,t−1= 0∨N_b,t−1<log(t)then 4: a_t=b

5: else

6: n_t−1= min{N_a,t−1, N_b,t−1} 7: I_a,t−1←Ia(N_a,t−1, n_t−1) 8: I_b,t−1←I_b(N_b,t−1, n_t−1)

9: Calculate v˜a,t = bv(Xa,t−1(Ia,t−1)) and v˜b,t = bv(X_b,t−1(I_b,t−1))

10: at= arg max_i∈{a,b}v˜i,t (break ties by choosing arm with fewer tries)

11: end if 12: return a_t

If there is a strong belief that one arm should be better than the other then instead of using factorlog(t)in Algorithm BESA+, one can useαlog(t)factor (where0< α <1and is constant) to reduce the final regret.

The first major difference between BESA+ and BESA is the use of the safety-aware value function instead of the simple regret. A second important change is that BESA+ selects the arm which has been tried less up to time steptif the arm has been chosen less thanlog(t)times up tot. Essentially, this change in the algorithm is negligible in terms of establish- ing the total expected regret, as we cannot achieve any better bound thanlog(T)which is shown in Robbins’ lemma [Lai

(4)

and Robbins, 1985]. This tweak also turns out to be vital in proving that the expected regret of the BESA+ algorithm is bounded bylog(T)(a result which we present shortly).

To better understand why this modification is necessary, consider a two arms scenario. The first arm gives a deter- ministic reward of r ∈ [0,0.5) and the second arm has a uniform distribution in the interval [0,1] with the expected reward of 0.5. If we are only interested in the expected reward(µ), the algorithm should ultimately favor the second arm. On the other hand, there exists a probability ofrthat the BESA algorithm is going to constantly choose the first arm if the second arm gives a value less thanron its first pull.

In contrast, BESA+ evades this problem by letting the second arm be selected enough times such that it eventually becomes distinguishable from the first arm.

We are now ready to state the main theoretical result of our proposed algorithm.

Theorem 0.1. Letvbe a well-behaved safety value function.

AssumeA = {a, ?} be a two-armed bandit with bounded rewards∈[0,1], and the value gap∆ =v_?−v_a. Given the value γ, the expected safety-aware regret of the Algorithm BESA+ up to timeT is upper bounded as follows:

RT 6ζ∆,γlog(T) +θ∆,γ (7) where in(7),ζ∆,γ, θ∆,γ are constants which are dependent on the value ofγ,∆.

Proof. Due to the page limit, we could not include all the proof. Here, we just provide a short overview of the proof.

The proof mainly consists of two parts. The first part of our proof is similar to [Baransiet al., 2014] but instead we have used McDiarmid’s Lemma [El-Yaniv and Pechyony, 2009]

[Tolstikhin, 2017]. For the second part of the proof, unlike [Baransiet al., 2014], we have avoided using the unproven lemma in their work and instead tried to compute the upper bound directly by exploiting the log trick in our algorithm (this trick has been further elaborated in the first experiment).

Interested reader can visithereto see the full proof.

Theorem 0.2. Letvbe a well-behaved safety value function.

Assume A = {a1, . . . , a_k−1, ?} be a k-armed bandit with bounded rewards ∈ [0,1]. Without loss of generality, consider the optimal arm is ? and the value gap for arm a, ? is∆_a=v_?−v_a. Also consider∆_max= max_a∈A∆_a. Given the valueγ, the expected safety-aware regret of the Algorithm BESA+ up to timeT is upper bounded as follows:

RT 6∆maxdlogke

∆ba

[ζ∆

ba,γlog(T) +θ∆

ba,γ] +k∆maxn (8) where in(8),ζ, θ are constants which are dependent on the value ofγ,∆. Moreover,bais defined:

ba= arg max

a∈A

ζ∆_a,γlog(T) +θ∆_a,γ

forT >n.

Proof. We Know that the arm?has to play at mostdlogke matches (games) in order to win the round. If it losses any of

thesedlogkegames, we know that at that round we will see a regret. This regret should be less than or equal to∆max.

In the following,We use notation111_−a_?,i to denote the in- dicator for the event ofa? losing the ith match (1 6 i 6 dlogke).

RT =

T

X

t=1 k

X

i=1

∆_a_iE[111_a_t_=a_i]

6

T

X

t=1 dlogke

X

i=1

∆maxE[111_−a_?,i]

6

T

X

t=1 dlogke

X

i=1

∆_maxmax

i⁰ {E[111_−a_?_,i⁰]}

6

dlogke

X

i=1

∆max T

X

t=1

maxi⁰ {E[111−a_?,i⁰]}

6 ∆maxdlogke

∆

ba T

X

t=n

∆baE[111_−a_?_,

ba] +k∆maxn 6 ∆maxdlogke

∆

ba

[ζ∆

ab,γlog(T) +θ∆

ab,γ] +k∆maxn (9)

Empirical results

Empirical comparison of BESA and BESA+

As discussed in the previous section, BESA+ has some ad- vantages over BESA. We illustrate the example we discussed in the previous section through the results in Figures 1-3, forr ∈ {0.2,0.3,0.4}. Each experiment has been repeated 200 times. Note that while BESA has an almost a linear regret behavior, BESA+ can learn the optimal arm within the given time horizon and its expected accumulated regret is upper bounded by a log function. It is also easy to notice that BESA+ has a faster convergence rate compared with BESA.

Asr gets closer to 0.5, the problem becomes harder. This phenomenon is a direct illustration of our theoretical result.

0 2000 4000 6000 8000 10000

steps 0

100 200 300 400 500 600 700 800

regret

BESABESA+

Figure 1: Result of accumulated expected regret forr= 0.4

(5)

0 2000 4000 6000 8000 10000 steps

0 200 400 600 800 1000

regret

BESABESA+

0 2000 4000 6000 8000 10000

steps 0

200 400 600 800 1000

regret

BESABESA+

Conditional value at risk safety value function As discussed in [Galichetet al., 2013], in some situations, we need to limit the exploration of risky arms. Examples include financial investment where inverters may tend to choose risk- averse kind of strategy. Using conditional value at risk as a risk measure is one of the approaches to achieve this goal.

Informally, conditional value at risk levelαis defined as the expected values of the quantiles of reward distribution where the probability of the occurrence of values inside this quantile is less than or equal toα. More formally:

CV aR_α=E[X|X < vα] (10) where in (10), v_α = arg max_β{P(X < β) 6 α}. To estimate (10), we have used the estimation measure intro- duced by [Chen, 2007]. This estimation is also employed in [Galichetet al., 2013] work to derive their MARAB algorithm. Here, we have used this estimation for the Conditional value at risk safety value function which is the regret measure for this problem. Our environment consists of 20 arms where each arm reward distribution is the truncated Gaussian mixture consisting of four Gaussian distribution with equal probability. The reward of arms are restricted to the interval [0,1]. To make the environment more complex, the mean and standard deviation of arms are sampled uniformly from the interval [0,1]and[0.5,1]respectively. The experiments are carried out forα = 10%. For MARAB algorithm, we have used grid search and set the valueC= 1. The figures 4, 5 depict the results of the run for ten experiments. It is noticeable that in both figures BESA+ has a lower variance in experiments.

0 200 400 600 800 1000

steps 0

5000 10000 15000 20000 25000 30000

expected regret

MARAB Algorithm BESA+

Figure 4: Accumulated regret figure. The safety value function here is conditional value at risk.

0 200 400 600 800 1000

steps 0

10 20 30 40 50 60 70 80

Percentage

MARAB Algorithm BESA+

Figure 5: Percentage of optimal arm play figure. The safety value function here is conditional value at risk.

Mean-variance safety value function

Next, we evaluated the performance of BESA+ with the regret definition provided by [Saniet al., 2012a]. Here, we used the same 20 arms Gaussian mixture environment described in the previous section. We evaluated the experiments withρ = 1 which is the trade off factor between variance and the mean.

The results of this experiment is depicted in figures 6, 7. The hyper-parameters used here for algorithms MV-LCB and Ex- pExp are based on what [Saniet al., 2012a] suggests using.

Again, we can see that BESA+ has a relatively small variance over 10 experiments.

Real Clinical Trial Dataset

Finally, we examined the performance of BESA+ against other methods (BESA, UCB1 , Thompson sampling, MV- LCB, and ExpExp) based on a real clinical dataset. This dataset includes the survival times of patients who were suf- fering from lung cancer [Ripley et al., 2013]. Two different kinds of treatments (standard treatment and test treatment) were applied to them and the results are based on the number of days the patient survived after receiving one of the treatments. For the purpose of illustration and simplicity, we as- sumed non-informative censoring and equal follow-up times in both treatment groups. As the experiment has already been conducted, to apply bandit algorithms, each time a treatment is selected by a bandit algorithm, we sampled uniformly from the recorded results of the patients whom received that selected treatment and used the survival time as the reward sig- nal. Figure 8 shows the distribution of treatment 1 and 2. We

(6)

Figure 6: Accumulated regret figure. The safety value function here is mean-variance.

Figure 7: Percentage of optimal arm play figure. The safety value function here is mean-variance.

categorized the survival time into ten categories (category 1 showing the minimum survival time). It is interesting to notice that while treatment 2 has a higher mean than treatment 1 due to the effect of outliers, it has a higher level of variance compared to treatment 1. From figure 8 it is easy to deduce that treatment 1 has a more consistent behavior than treatment 2 and a higher number of patients who received treatment 2 died early. That is why treatment 1 may be preferred over treatment 2 if we use the safety value function described in Example 1. In this regard, by setting ρ = 1, treatment 1 has less expected mean-variance regret than treatment 2, and it should be ultimately favored by the learning algorithm.

Figure 9 illustrates the performance of different bandit algorithms. It is easy to notice that BESA+ has relatively better performance than all the other ones.

Web Application Simulator

As discussed earlier, for this project, we have developed a web application simulator for bandit problem where users can create their customized environment and run experiments online. Usually, research works provide limited experiments to testify their method. We tried to overcome this problem by developing this web application where the user can select number of arms and change their reward distribution. Then the web application will send the input to the web-server and show the results to the user by providing regret figures and additional figures describing the way algorithms have chosen arms over time. This software can be used as a benchmark for future bandit research and it is open sourced for future

1 2 3 4 5 6 7 8 9 10

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.7 treatment 1

treatment 2

Figure 8: Distribution graph

0 200 400 600 800 1000

steps 0

2 4 6 8 10 12

regret

UCB1Thompson Sampling MV_LCB Algorithm ExpExp Algorithm BESABESA+

Figure 9: Accumulated consistency-aware regret

extension. The link ishere

Conclusion and future work

In this paper, we developed a modular safety-aware regret definition which can be used to define the function of interest as a safety measure. We also modified the BESA algorithm and equipped it with new features to solve modular safety- aware regret bandit problems. We then computed the asymp- totic regret of BESA+ and showed that it can perform like an admissible policy if the safety value function satisfies a mild assumption. Finally, we depicted the performance of BESA+

on the regret definition of previous works and showed that it can have better performance in most cases.

It is still interesting to investigate whether we can find better bounds for BESA+ algorithm with modular safety-aware regret definition. Another interesting path would be to research if we can define similar safety-aware regret definition for broader reinforcement learning problems including MDP environments.

Acknowledgment

We would like to thank Audrey Durand for her comments and insight on this project. We also thank department of family medicine of McGill University and CIHR for their generous support during this project.

References

[Agrawal and Goyal, 2013] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson

(7)

sampling. In Artificial Intelligence and Statistics, pages 99–107, 2013.

[Aueret al., 2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2-3):235–256, 2002.

[Austin, 2011] Peter C Austin. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research, 46(3):399–424, 2011.

[Baransiet al., 2014] Akram Baransi, Odalric-Ambrym Maillard, and Shie Mannor. Sub-sampling for multi- armed bandits. InECML-KDD, pages 115–131, 2014.

[Burnetas and Katehakis, 1996] Apostolos N Burnetas and Michael N Katehakis. Optimal adaptive policies for se- quential allocation problems. Advances in Applied Math- ematics, 17(2):122–142, 1996.

[Chapelle and Li, 2011] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. InAdvances in neural information processing systems, pages 2249–

2257, 2011.

[Chen, 2007] Song Xi Chen. Nonparametric estimation of expected shortfall. Journal of financial econometrics, 6(1):87–107, 2007.

[El-Yaniv and Pechyony, 2009] Ran El-Yaniv and Dmitry Pechyony. Transductive rademacher complexity and its applications. Journal of Artificial Intelligence Research, 35(1):193, 2009.

[Galichetet al., 2013] Nicolas Galichet, Michele Sebag, and Olivier Teytaud. Exploration vs exploitation vs safety:

Risk-aware multi-armed bandits. InAsian Conference on Machine Learning, pages 245–260, 2013.

[Garcıa and Fern´andez, 2015] Javier Garcıa and Fernando Fern´andez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.

[Kuleshov and Precup, 2014] Volodymyr Kuleshov and Doina Precup. Algorithms for multi-armed bandit problems. arXiv preprint arXiv:1402.6028, 2014.

[Lai and Robbins, 1985] Tze Leung Lai and Herbert Rob- bins. Asymptotically efficient adaptive allocation rules.

Advances in applied mathematics, 6(1):4–22, 1985.

[Maillard, 2013] Odalric-Ambrym Maillard. Robust risk- averse stochastic multi-armed bandits. In ICML, pages 218–233, 2013.

[Ripleyet al., 2013] Brian Ripley, Bill Venables, Douglas M Bates, Kurt Hornik, Albrecht Gebhardt, David Firth, and Maintainer Brian Ripley. Package ‘mass’. Cran R, 2013.

[Robbins, 1985] Herbert Robbins. Some aspects of the se- quential design of experiments. InHerbert Robbins Se- lected Papers, pages 169–177. Springer, 1985.

[Saniet al., 2012a] Amir Sani, Alessandro Lazaric, and R´emi Munos. Risk-aversion in multi-armed bandits.

In Advances in Neural Information Processing Systems, pages 3275–3283, 2012.

[Saniet al., 2012b] Amir Sani, Alessandro Lazaric, and R´emi Munos. Risk-aversion in multi-armed bandits. In NIPS, pages 3275–3283, 2012.

[Sutton and Barto, 1998] Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction, volume 1.

MIT press Cambridge, 1998.

[Tolstikhin, 2017] IO Tolstikhin. Concentration inequalities for samples without replacement.Theory of Probability &

Its Applications, 61(3):462–481, 2017.