Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

(1)

HAL Id: hal-01941220

https://hal.inria.fr/hal-01941220

Submitted on 30 Nov 2018

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Near Optimal Exploration-Exploitation in

Non-Communicating Markov Decision Processes

Ronan Fruit, Matteo Pirotta, Alessandro Lazaric

To cite this version:

Ronan Fruit, Matteo Pirotta, Alessandro Lazaric. Near Optimal Exploration-Exploitation in

Non-Communicating Markov Decision Processes. 32nd Conference on Neural Information Processing

Sys-tems, Dec 2018, Montréal, Canada. �hal-01941220�

(2)

Near Optimal Exploration-Exploitation in

Non-Communicating Markov Decision Processes

Ronan Fruit Sequel Team - Inria Lille ronan.fruit@inria.fr

Matteo Pirotta Sequel Team - Inria Lille matteo.pirotta@inria.fr

Alessandro Lazaric Facebook AI Research

lazaric@fb.com

Abstract

While designing the state space of an MDP, it is common to include states that are transient or not reachable by any policy (e.g., in mountain car, the product space of speed and position contains configurations that are not physically reachable). This results in weakly-communicating or multi-chain MDPs. In this paper, we introduce

TUCRL, the first algorithm able to perform efficient exploration-exploitation in any finite Markov Decision Process (MDP) without requiring any form of prior knowledge. In particular, for any MDP withSC_{communicating states,}_{A actions}

andΓC_{≤ S}C_{possible communicating next states, we derive a e}_O(DC√_ΓC_SC_{AT )}

regret bound, whereDC_{is the diameter (i.e., the length of the longest shortest}

path between any two states) of the communicating part of the MDP. This is in contrast with existing optimistic algorithms (e.g.,UCRL, OptimisticPSRL) that suffer linear regret in weakly-communicating MDPs, as well as posterior sampling or regularised algorithms (e.g.,REGAL), which require prior knowledge on the bias

span of the optimal policy to achieve sub-linear regret. We also prove that in weakly-communicating MDPs, no algorithm can ever achieve a logarithmic growth of the regret without first suffering a linear regret for a number of steps that is exponential in the parameters of the MDP. Finally, we report numerical simulations supporting our theoretical findings and showing howTUCRLovercomes the limitations of the state-of-the-art.

1 Introduction

Reinforcement learning (RL) [1] studies the problem of learning in sequential decision-making problems where the dynamics of the environment is unknown, but can be learnt by performing actions and observing their outcome in an online fashion. A sample-efficient RL agent must trade off the exploration needed to collect information about the environment, and the exploitation of the experience gathered so far to gain as much reward as possible. In this paper, we focus on the regret framework in infinite-horizon average-reward problems [2], where the exploration-exploitation performance is evaluated by comparing the rewards accumulated by the learning agent and an optimal policy. Jaksch et al. [2] showed that it is possible to efficiently solve the exploration-exploitation dilemma using the optimism in face of uncertainty (OFU) principle. OFU methods build confidence intervals on the dynamics and reward (i.e., construct a set of plausible MDPs), and execute the optimal policy of the “best” MDP in the confidence region [e.g., 2, 3, 4, 5, 6]. An alternative approach is posterior sampling (PS) [7], which maintains a posterior distribution over MDPs and, at each step, samples an MDP and executes the corresponding optimal policy [e.g., 8, 9, 10, 11, 12].

Weakly-communicating MDPs and misspecified states. One of the main limitations ofUCRL[2] and optimisticPSRL[12] is that they require the MDP to be communicating so that its diameter D (i.e., the length of the longest path among all shortest paths between any pair of states) is finite. While assuming that all states are reachable may seem a reasonable assumption, it is rarely verified in practice. In fact, it requires a designer to carefully define a state space_{S that contains all reachable}

(3)

Initial state s1 Reachable from s1 NOT reachable from s1 (a) Breakout −1.2 −1 −0.8−0.6−0.4−0.2 0 0.2 0.4 −0.05 0 0.05 Unreachable states s1 Position x V elo cit y ˙x (b) Mountain Car

Figure 1: Examples of non-communicating domains. Fig. b represents a phase plane plot of the Mountain car domain(x, ˙x)∈ [−1.2, 0.6] × [−0.07, 0.07]. The initial state is (−0.5, 0) and the red area corresponds to non-reachable states from the initial state. Other non-reachable states may exist. Fig. a shows the initial state, one reachable state (middle) and an unreachable one (right).

states (otherwise it may not be possible to learn the optimal policy), but it excludes unreachable states (otherwise the resulting MDP would be non-communicating). This requires a considerable amount of prior knowledge about the environment. Consider a problem where we learn from images e.g., the Atari Breakout game [13]. The state space is the set of “plausible” configurations of the brick wall, ball and paddle positions. The situation in which the wall has an hole in the middle is a valid state (e.g., as an initial state) but it cannot be observed/reached starting from a dense wall (see Fig. 1a). As such, it should be removed to obtain a “well-designed” state space. While it may be possible to design a suitable set of “reachable” states that define a communicating MDP, this is often a difficult and tedious task, sometimes even impossible. Now consider a continuous domain e.g., the Mountain Car problem [14]. The state is decribed by the positionx and velocity ˙x along the x-axis. The state space of this domain is usually defined as the cartesian product[_{−1.2, 0.6] × [−0.07, 0.07].} Unfortunately, this set contains configurations that are not physically reachable as shown on Fig. 1b. The dynamics of the system is constrained by the evolution equations. Therefore, the car can not go arbitrarily fast. On the leftmost position (x =_{−1.2) the speed ˙x cannot exceed 0 due to the fact that} such position can be reached only with velocity ˙x≤ 0. To have a higher velocity, the car would need to acquire momentum from further left (i.e.,x <_{−1.2) which is impossible by design (−1.2 is the} left-boundary of the position domain). The maximal speed reachable forx >−1.2 can be attained by applying the maximum acceleration at any time step starting from the state(x, ˙x) = (_{−1.2, 0). This} identifies the curve reported in the Fig. 1b which denotes the boundary of the unreachable region. Note that other states may not be reachable. Whenever the state space is misspecified or the MDP is weakly communicating (i.e.,D = +_{∞), OFU-based algorithms (e.g.,}UCRL) optimistically attribute large reward and non-zero probability to reach states that have never been observed, and thus they tend to repeatedly attempt to explore unreachable states. This results in poor performance and linear regret. A first attempt to overcome this major limitation isREGAL.C[3] (Fruit et al. [6] recently proposedSCAL, an implementable efficient version ofREGAL.C), which requires prior knowledge of an upper-boundH to the span (i.e., range) of the optimal bias function h∗. The optimism ofUCRL

is then “constrained” to policies whose bias has span smaller thanH. This implicitly “removes” non-reachable states, whose large optimistic reward would cause the span to become too large. Unfortunately, an accurate knowledge of the bias span may not be easier to obtain than designing a well-specified state space. Bartlett and Tewari [3] proposed an alternative algorithm –REGAL.D– that leverages on the doubling trick [15] to avoid any prior knowledge on the span. Nonetheless, we recently noticed a major flaw in the proof of [3, Theorem 3] that questions the validity of the algorithm (see App. A for further details). PS-based algorithms also suffer from similar issues.1 _To

the best of our knowledge, the only regret guarantees available in the literature for this setting are [17, 18, 19]. However, the counter-example of Osband and Roy [20] seems to invalidate the result of Abbasi-Yadkori and Szepesvári [17]. On the other hand, Ouyang et al. [18] and Theocharous et al. [19] present PS algorithms with expected Bayesian regret scaling linearly withH, where H is an upper-bound on the optimal bias spans of all the MDPs that can be drawn from the prior distribution ([18, Asm. 1] and [19, Sec. 5]). In [18, Remark 1], the authors claim that their algorithm does not require the knowledge ofH to derive the regret bound. However, in App. B we show on a very simple

1

We notice that the problem of weakly-communicating MDPs and misspecified states does not hold in the more restrictive setting of finite horizon [e.g., 8] since exploration is directly tailored to the states that are reachable within the known horizon, or under the assumption of the existence of a recurrent state [e.g., 16].

(4)

example that for most continuous prior distributions (e.g., uninformative priors like Dirichlet), it is very likely thatH = +_{∞ implying that the regret bound may not hold (similarly for [19]). As a} result, similarly toREGAL.C, the prior distribution should contain prior knowledge on the bias span to avoid poor performance.

In this paper, we presentTUCRL, an algorithm designed to trade-off exploration and exploitation in weakly-communicating and multi-chain MDPs (e.g., MDPs with misspecified states) without any prior knowledge and under the only assumption that the agent starts from a state in a communicating subset of the MDP (Sec. 3). In communicating MDPs,TUCRLeventually (after a finite number of steps) performs as UCRL, thus achieving problem-dependent logarithmic regret. When the true MDP is weakly-communicating, we prove thatTUCRLachieves a eO(√T ) regret that with polynomial dependency on the MDP parameters. We also show that it is not possible to design an algorithm achieving logarithmic regret in weakly-communicating MDPs without having an exponential dependence on the MDP parameters (see Sec. 5). TUCRLis the first computationally tractable algorithm in the OFU literature that is able to adapt to the MDP nature without any prior knowledge. The theoretical findings are supported by experiments on several domains (see Sec. 4).

2 Preliminaries

We consider a finite weakly-communicating Markov decision process [21, Sec. 8.3]M =_{hS, A, r, pi} with a set of states_{S and a set of actions A =} S_s∈S_As. Each state-action pair(s, a)∈ S × As

is characterized by a reward distribution with meanr(s, a) and support in [0, rmax] as well as a

transition probability distributionp(_{·|s, a) over next states. In a weakly-communicating MDP, the} state-space_{S can be partioned into two subspaces [21, Section 8.3.1]: a communicating set of states} (denoted_SC_{in the rest of the paper) with each state in}_SC_{accessible –with non-zero probability–}

from any other state inSC_{under some stationary deterministic policy, and a –possibly empty– set}

of states that are transient under all policies (denoted_ST_{). We also denote by}_{S =}_{|S|, S}C₌_|SC_|

andA = maxs∈S|As| the number of states and actions, and by ΓC= maxs∈SC_,a∈Akp(·|s, a)k₀the

maximum support of all transition probabilitiesp(_{·|s, a) with s ∈ S}C_{. The sets}_SC_and_ST_{form a}

partition of_{S i.e., S}C_{∩ S}T₌_{∅ and S}C_{∪ S}T₌_{S. A deterministic policy π : S → A maps states to}

actions and it has an associated long-term average reward (or gain) and a bias function defined as

gπM(s) := lim T→∞E ₁ T T X t=1 r st, π(st) ; hπM(s) := C- lim T→∞ E _XT t=1 r(st, π(st))− gMπ (st) ,

where the biashπ

M(s) measures the expected total difference between the rewards accumulated by

π starting from s and the stationary reward in Cesaro-limit2_(denoted _{C- lim). Accordingly, the}

difference of bias valueshπ

M(s)− hπM(s0) quantifies the (dis-)advantage of starting in state s rather

thans0_{. In the following, we drop the dependency on}_{M whenever clear from the context and}

denote byspS{hπ} := maxs∈Shπ(s)− mins∈Shπ(s) the span of the bias function. In weakly

communicating MDPs, any optimal policyπ∗_{∈ arg max}

πgπ(s) has constant gain, i.e., gπ

∗

(s) = g∗

for alls ∈ S. Finally, we denote by D, resp. DC_{, the diameter of}_{M , resp. the diameter of the}

communicating part ofM (i.e., restricted to the set_SC_):

D := max

(s,s0_{)∈S×S,s6=s}0{τM(s→ s

0₎_}, _DC_:= _max

(s,s0_)∈SC_×SC_,s6=s0{τM(s→ s

0₎_}, ₍₁₎

whereτM(s→ s0) is the expected time of the shortest path from s to s0inM .

Learning problem. LetM∗_{be the true (unknown) weakly-communicating MDP. We consider the}

learning problem where_{S, A and r}maxare known, while setsSCandST, rewardsr and transition

probabilitiesp are unknown and need to be estimated on-line. We evaluate the performance of a learning algorithm A afterT time steps by its cumulative regret ∆(A, T ) = T g∗₋PT

t=1rt(st, at).

Furthermore, we state the following assumption.

Assumption 1. The initial states1belongs to the communicating set of statesSC.

While this assumption somehow restricts the scenario we consider, it is fairly common in practice. For example, all the domains that are characterized by the presence of a resetting distribution (e.g., episodic problems) satisfy this assumption (e.g., mountain car, cart pole, Atari games, taxi, etc.).

2

(5)

Multi-chain MDPs. While we consider weakly-communicating MDPs for ease of notation, all our results extend to the more general case of multi-chain MDPs.3 _{In this case, there may be multiple}

communicating and transient sets of states and the optimal gaing∗_{is different in each communicating}

subset. In this case we defineSC_{as the set of states that are accessible –with non-zero probability–}

froms1(s1included) under some stationary deterministic policy.STis defined as the complement of

SC_in_{S i.e., S}T_:=_{S \ S}C_{. With these new definitions of}_SC_and_ST_{, Asm. 1 needs to be reformulated}

as follows:

Assumption 1 for Multi-chain MDPs. The initial states1is accessible –with non-zero probability–

from any other state in_SC_{under some stationary deterministic policy. Equivalently,}_SC_{is a}

commu-nicating set of states.

Note that the states belonging to_ST_{can either be transient or belong to other communicating subsets}

of the MDP disjoint from_SC_{. It does not really matter because the states in}_ST_{will never be visited}

by definition. As a result, the regret is still defined as before, where the learning performance is compared to the optimal gaing∗_(s

1) related to the communicating set of statesSC3 s1.

3 Truncated Upper-Confidence for Reinforcement Learning (

TUCRL

)

In this section we introduce Truncated Upper-Confidence for Reinforcement Learning (TUCRL), an optimistic online RL algorithm that efficiently balances exploration and exploitation to learn in non-communicating MDPs without prior knowledge (Fig. 2).

Similar toUCRL, at the beginning of each episodek,TUCRLconstructs confidence intervals for the reward and the dynamics of the MDP. Formally, for any(s, a)_{∈ S × A we define}

where_{C = {p ∈ R}S_|∀s0, p(s0)≥ 0 ∧Ps0p(s0) = 1} is the (S − 1)-probability simplex, while the

size of the confidence intervals is constructed using the empirical Bernstein’s inequality [22, 23] as

βr,ksa := s 14_bσ2 r,k(s, a)bk,δ Nk+(s, a) + 49 3rmaxbk,δ Nk±(s, a) , βsasp,k0 := s 14_bσ2 p,k(s0|s, a)bk,δ Nk+(s, a) + 49 3bk,δ Nk±(s, a) ,

whereNk(s, a) is the number of visits in (s, a) before episode k, Nk+(s, a) := max{1, Nk(s, a)},

N_k±(s, a) := max_{{1, N}k(s, a)−1}, bσr,k2 (s, a) andbσ2p,k(s0|s, a) are the empirical variances of r(s, a)

andp(s0_{|s, a) and b}

k,δ= ln(2SAtk/δ). The set of plausible MDPs associated with the confidence

intervals is then_Mk =M = (S, A, er, ep) : er(s, a) ∈ Br,k(s, a), p(e·|s, a) ∈ Bp,k(s, a) . UCRL

is optimistic w.r.t. the confidence intervals so that for all statess that have never been visited the optimistic reward_{er(s, a) is set to r}max, while all transitions tos (i.e.,p(se |·, ·)) are set to the largest

value compatible withBp,k(·, ·). Unfortunately, some of the states with Nk(s, a) = 0 may be actually

unreachable (i.e.,s_{∈ S}T_{) and}_UCRL_{would uniformly explore the policy space with the hope that at}

least one policy reaches those (optimistically desirable) states.TUCRLaddresses this issue by first constructing empirical estimates of_SC_and

ST_{(i.e., the set of communicating and transient states in}

M∗_{) using the states that have been visited so far, that is}_SC k:=

s_{∈ S} P_a_∈A_sNk(s, a) > 0 ∪

{stk} and S

T

k:=S \ SkC, wheretkis the starting time of episodek.

In order to avoid optimistic exploration attempts to unreachable states, we could simply execute

UCRLon_SC

k, which is guaranteed to contain only states in the communicating set (sinces1∈ SCby

Asm. 1, we have that_SC

k ⊆ SC). Nonetheless, this algorithm could under-explore state-action pairs

that would allow discovering other states in_SC_{, thus getting stuck in a subset of the communicating}

states of the MDP and suffering linear regret. While the states in_SC

k are guaranteed to be in the

communicating subset, it is not possible to know whether states in_ST

k are actually reachable from

SC

kor not. ThenTUCRLfirst “guesses” a lower bound on the probability of transition from states

s∈ SC

ktos0 ∈ SkTand whenever the maximum transition probability froms to s0compatible with the

confidence intervals (i.e.,p_bk(s0|s, a)+βsas

0

p,k ) is below the lower bound, it assumes that such transition

3

In the case of misspecified states, we implicitly define a multi-chain MDP, where each non-reachable state has a self-loop dynamics and it defines a “singleton” communicating subset.

(6)

Input: Confidenceδ ∈]0, 1[, rmax, S, A

Initialization: SetN0(s, a) := 0 for any (s, a) ∈ S × A, t := 1 and observe s1. For episodesk = 1, 2, ... do

1. Settk= t and episode counters νk(s, a) = 0 2. Compute estimatespk(s_b 0_{|s, a),}

b

rk(s, a) and a set Mk 3. Compute anrmax/√tk-approximationπk_e of Eq. 5 4. Whiletk== t orP

a∈A_stNk(st, a) > 0 and νk(st,πk(st)) ≤ max {1, Nke (st,πk(st))}e

do (a) Executeat=πk(st), obtain reward rt, and observee st+1

(b) Setνk(st, at) += 1 and set t += 1 5. SetNk+1(s, a) = Nk(s, a) + νk(s, a)

Figure 2:TUCRL algorithm.

is not possible. This strategy is based on the intuition that a transition either does not exist or it should have a sufficiently “big” mass. However, these transitions should be periodically reconsidered in order to avoid under-exploration issues. More formally, let(ρt)t∈Nbe a non-increasing sequence

to be defined later, for alls0 ∈ ST

k,s∈ SkCanda∈ As, the empirical meanpbk(s0|s, a) and variance

bσ2

p,k(s0|s, a) are zero (i.e., this transition has never been observed so far), so the largest probability

(most optimistic) of transition froms to s0_{through any action}_{a is}_p_e+

k(s0|s, a) = 493 bk,δ

N_k±(s,a).TUCRL

comparesp_e+_k(s0_{|s, a) to ρ}

tkand forces all transition probabilities below the threshold to zero, while

the confidence intervals of transitions to states that have already been explored (i.e., in _SC k) are

preserved unchanged. This corresponds to constructing the alternative confidence interval Bp,k(s, a) = Bp,k(s, a)∩p(e·|s, a) ∈ C : ∀s0 ∈ SkTandpe

+

k(s0|s, a) < ρtk,p(se

0_{|s, a) = 0} _{. (4)}

GivenBp,k,TUCRL(implicitly) constructs the corresponding set of plausible MDPsMkand then

solves the optimistic optimization problem

( fMk,eπk) = arg max M∈Mk,π

{gπ

M}. (5)

The resulting algorithm follows the same structure asUCRLand it is shown in Fig. 2. The episode stopping condition at line 4 is slightly modified w.r.t.UCRL. In fact, it guarantees that one action is always executed and it forces an episode to terminate as soon as a state previously in_ST

kis visited

(i.e.,Nk(st, a) = 0). This minor change guarantees that Nk+1(s, a) = 0 for all the states s∈ SkTthat

were not reachable at the beginning of the episode. The algorithm also needs minor modifications to the extended value iteration (EVI) algorithm used to solve (5) to guarantee both efficiency and convergence. All technical details are reported in App. C.

In practice, we setρt= 49b₃t,δ

q

SA

t , so that the condition to remove transition reduces toNk±(s, a) >

p_t

k/SA. This shows that only transitions from state-action pairs that have been poorly visited so far are enabled, while if the state-action pair has already been tried often and yet no transition to s0∈ ST

kis observed, then it is assumed thats0is not reachable froms, a. When the number of visits

in(s, a) is big, the transitions to “unvisited” states should be discarded because if the transition actually exists, it is most likely extremely small and so it is worth exploring other parts of the MDP first. Symmetrically, when the number of visits in(s, a) is small, the transitions to “unvisited” states should be enabled because the transitions are quite plausible and the algorithm should try to explore the outcome of taking actiona in s and possibly reach states inST

k. We denote the set of state-action

pairs that are not sufficiently explored by_Kk =(s, a)∈ SkC× A : Nk±(s, a)≤

p_t

k/_SA .

3.1 Analysis ofTUCRL

We prove that the regret ofTUCRLis bounded as follows.

Theorem 1. For any weakly communicating MDPM , with probability at least 1_{− δ it holds that for} anyT > 1, the regret of TUCRLis bounded as

∆(TUCRL, T ) = O rmaxDC s ΓC_SC_{AT ln} _SAT δ + rmax DC2_S3_{A ln}2SAT δ ! .

(7)

The first term in the regret shows the ability ofTUCRLto adapt to the communicating part of the true MDPM∗_{by scaling with the communicating diameter}_DC_{and MDP parameters}_SC_and_ΓC_{. The}

second term corresponds to the regret incurred in the early stage where the regret grows linearly. WhenM∗is communicating, we match the square-root term ofUCRL(first term), while the second term is just slightly bigger than the one appearing inUCRLby a multiplicative factorS2_/

ΓC_(ignoring

logarithmic terms, see Sec. 5).

We now provide a sketch of the proof of Thm. 1 (the full proof is reported in App. D). In order to preserve readability, all following inequalities should be interpreted up to minor approximations and in high probability.

Let∆k := Ps,aνk(s, a)(g∗− r(s, a)) be the regret incurred in episode k, where νk(s, a) is the

number of visits tos, a in episode k. We decompose the regret as

∆(TUCRL, T ) . m X k=1 ∆k· 1{M∗∈ Mk} . m X k=1 ∆k· 1{tk< C(k)} + m X k=1 ∆k· 1{tk≥ C(k)} whereC(k) = O (DC₎2_S3_{A ln}2

(2SAtk/δ)defines the length of a full exploratory phase, where the agent may suffer linear regret.

Optimism. The first technical difficulty is that whenever some transitions are disabled, the plausible set of MDPs_Mkmay actually be biased and not contain the true MDPM∗. This requires to prove

thatTUCRL(i.e., the gain of the solution returned byEVI) is always optimistic despite “wrong” confidence intervals. The following lemma helps to identify the possible scenarios thatTUCRLcan produce (see App. D.2).4

Lemma 1. Let episodek be such that M∗ _{∈ M}

k,SkT6= ∅ and tk ≥ C(k). Then, either SkT=ST

(case I) or_Kk 6= ∅, i.e., ∃(s, a) ∈ SkC× A for which transitions to SkTare allowed (case II).

This result basically excludes the case where_ST

k ⊃ ST(i.e., some states have not been reached) and

yet no transition from_SC

k to them is enabled. We start noticing that whenSkT=∅, the true MDP

M∗ _{∈ M}

k =Mk w.h.p. by construction of the confidence intervals. Similarly, ifSkT =STthen

M∗∈ Mkw.h.p., sinceTUCRLonly truncates transitions that are indeed forbidden inM∗itself. In

both cases, we can use the same arguments in [2] to prove optimism. In case II the gain of any state s0∈ ST

kis set tormaxand, since there exists a path fromSkCtoSkT, the gain of the solution returned

by EVI isrmax, which makes it trivially optimistic. As a result we can conclude thategk & g∗(up to

the precision ofEVI).

Per-episode regret. After bounding the optimistic reward_erk(s, a) w.r.t. r(s, a), the only part left to

bound the per-episode regret∆kis the term e∆k=P_s,aνk(s, a)(egk− erk(s, a)). Similar toUCRL,

we could use the (optimistic) optimality equation and rewrite e∆kas

e ∆k= X s∈S νk(s,eπk(s)) X s0_∈_S e pk(s0|s, eπk(s))ehk(s0)− ehk(s) = ν0k e Pk− I wk (6)

wherewk := ehk− mins∈S{ehk}e is a shifted version of the vector ehk returned by EVI at episode

k, and then proceed by bounding the difference between ePk andPkusing standard concentration

inequalities. Nonetheless, we would be left with the problem of bounding the`_∞norm of wk

(i.e., the range of the optimistic vector ehk) over the whole state space, i.e.,kwkk∞= spS{ehk} =

maxs∈Sehk(s)− mins∈Sehk(s). While in communicating MDPs, it is possible to bound this quantity

by the diameter of the MDP asspS{hk} ≤ D [2, Sec. 4.3], in weakly-communicating MDPs

D = +∞, thus making this result uninformative. As a result, we need to restrict our attention to the subset of communicating states_SC_{, where the diameter is finite. We then split the per-step regret}

over states depending on whether they are explored enough or not as∆k .Ps,aνk(s, a)(egk−

erk(s, a))1{(s, a) /∈ Kk} + rmaxPs,aνk(s, a)1{(s, a) ∈ Kk}. We start focusing on the poorly

visited state-action pairs, i.e.,(s, a)∈ Kk. In this caseTUCRLmay suffer the maximum per-step

regretrmaxbut the number of times this event happen is cumulatively “small” (App. D.4.1):

4

Notice thatM∗_{∈ Mk}

(8)

0 1 2 3 ·108 0 2 4 ·107 Duration T Regret ∆( T ) SCAL c = 200_TUCRL UCRL 0 2 4 6 8 ·106 1 1.5 2 2.5 ·104 Regret ∆( T ) SCAL c = 2 TUCRL UCRL 0 1,000 2,000 3,000 4,000 5,000 6,000 P T t=1 1{ N ± k(s, a ) ≤ q tk S A } 0 0.2 0.4 0.6 0.8 1 1.2 1.4 ·107 0 200 400 600 800 1,000 1,200 1,400 Duration T Cum ulativ e Regret ∆( T ) SCAL c = 10 SCAL c = 5 TUCRL ∝√T

Figure 3: Cumulative regret in the taxi with misspecified states (left-top) and in the communicating taxi (left-bottom), and in the weakly communicating three-states domain withD = +_{∞ (right).} Confidence intervalsβr,kandβp,kare shrunk by a factor0.05 and 0.01 for the three-states domain

and taxi, respectively. Results are averaged over20 runs and 95% confidence intervals are reported.

Lemma 2. For anyT ≥ 1 and any sequence of states and actions {s1, a1, . . . sT, aT} we have: m X k=1 X s,a νk(s, a)1{Nk±(s, a)≤ p tk/SA | {z } (s,a)∈Kk } ≤ T X t=1 1nN_k±_t(st, at)≤ p t_/_SAo_{≤ 2}√_SC_{AT + S}C_A When(s, a) /_{∈ K}k(i.e.,Nk±(s, a) > p_t k/_SAholds),P

s,aνk(s, a)(egk− erk(s, a))· 1{(s, a) /∈ Kk}

can be bounded as in Eq. 6 but now restricted on_SC

k, so that, νk( ePk− I)ehk= X s∈SC k νk(s,eπk(s)) X s0_∈_SC k e pk(s0|s, eπk(s))wk(s0)− wk(s) .

Since the stopping condition guarantees thatνk(s,eπk(s)) = 0 for all s∈ SkT, we can first restrict

the outer summation to states in_SC_{. Furthermore, all state-action pairs}_{(s, a) /}_{∈ K}

k are such that

the optimistic transition probability_epk(s0|s, a) is forced to zero for all s0 ∈ SkT, thus reducing the

inner summation. We are then left with providing a bound for the range ofwk restrictedto the

states in_SC

k, i.e.,spSC

k{wk} = maxs∈SCk{wk}. We recall thatEVIrun on a set of plausible MDPs

Mkreturns a function ehksuch that ehk(s0)− ehk(s)≤ rmax· τ_M_k(s→ s0), for any pair s, s0∈ S,

whereτ_M_k(s_{→ s}0_{) is the expected shortest path in the extended MDP}_M

k. Furthermore, since

M∗∈ Mk, for alls, s0∈ SkC,τMk(s→ s0)≤ D

C_{. Unfortunately, since}_M∗_{may not belong to}_M k,

the bound on the shortest path in_Mk(i.e.,τMk(s→ s0)) may not directly translate into a bound for

the shortest path in_Mk, thus preventing from bounding the range of ehkeven on the subset of states

in_SC

k. Nonetheless, in App. E we show that a minor modification to the confidence intervals ofMk

makes the shortest paths between any two statess, s0_{∈ S}C

kequivalent in both sets of plausible MDPs,

thus providing the boundsp_SC

k{wk} ≤ D

C_{. The final regret in Thm. 1 is then obtained by combining}

all different terms.

4 Experiments

In this section, we present experiments to validate the theoretical findings of Sec. 3. We compare

TUCRLagainstUCRLandSCAL.5We first consider the taxi problem [24] implemented in OpenAI Gym [25].6 _{Even such a simple domain contains misspecified states, since the state space is}

con-structed as the outer product of the taxi position, the passenger position and the destination. This

5

To the best of out knowledge, there exists no implementable algorithm to solve the optimization step of REGALand REGAL.D.

6

(9)

T E[∆(UCRL, T , M )] O(T ) O(DSpAT ln(T ))

OD2S2A γ ln(T ) 0 TM† T ∗ M Regret upper-bound (a) x y b d ε 1_{− ε} d b r = 0 r =1_/₂ r = 0 r = 1 (b) x y b d b d r = 0 r =1_/₂ r = 1 r = 0 (c)

Figure 4: 4a Expected regret ofUCRL(with known horizonT given as input) as a function of T . 4b 4c Toy example illustrating the difficulty of learning non-communicating MDPs. We represent a family of possible MDPs_{M = (M}ε)ε∈[0,1]where the probabilityε to go from x to y lies in [0, 1].

leads to states that cannot be reached from any possible starting configuration (all the starting states belong to_SC_{). More precisely, out of}_{500 states in}_{S, 100 are non-reachable. On Fig. 3(left) we}

compare the regret of UCRL,SCALand TUCRLwhen the misspecified states are present (top) and when they are removed (bottom). In the presence of misspecified states (top), the regret of

UCRLclearly grows linearly withT whileTUCRLis able to learn as expected. On the other hand, when the MDP is communicating (bottom)TUCRLperforms similarly toUCRL. The small loss in performance is most likely due to the initial exploration phase during which the confidence intervals on the transition probabilities used byUCRL(see definition of_Mk) are tighter than those used by

TUCRL(see definition of_M+_k).TUCRLuses a “loose” bound on the`1-norm whileUCRLusesS

different bounds, one for every possible next state. Finally,SCALoutperformsTUCRLby exploiting prior knowledge on the bias span.

We further studyTUCRLregret in the simple three-state domain introduced in [6] (see App. G for details) with different reward distributions (uniform instead of Bernouilli). The environment is composed of only three states (s0,s1ands2) and one action per state, except ins2where two actions

are available. As a result, the agent only has the choice between two possible policies. Fig. 3(left) shows the cumulative regret achieved byTUCRLandSCAL(with different upper-bounds on the bias span) when the diameter is infinite i.e.,_SC₌

{s0, s2} and ST={s1} (we omitUCRL, since

it suffers linear regret). BothSCALandTUCRLquickly achieve sub-linear regret as predicted by theory. However,SCALandTUCRLseem to achieve different growth rates in regret: whileSCAL

appears to reach a logarithmic growth, the regret ofTUCRLseems to grow as√T with periodic “jumps” that are increasingly distant (in time) from each other. This can be explained by the way the algorithm works: while most of the timeTUCRLis optimistic on the restricted state space_SC_(i.e.,

SC

k =SC), it periodically allows transitions to the setST(i.e.,SkC=S), which is indeed not reachable.

Enabling these transitions triggers aggressive exploration during an entire episode. The policy played is then sub-optimal creating a “jump” in the regret. At the end of this exploratory episode,_SC

k will be

set again to_SC_{and the regret will stop increasing until the condition}_N± k ≤

p_t

k/SAoccurs again (the time between two consecutive exploratory episodes grows quadratically). The cumulative regret incurred during exploratory episodes can be bounded by the term plotted in green on Fig. 3(left). In Lem. 2 we proved that this term is always bounded byO(√SC_{AT ). Therefore, it is not surprising}

to observe a√T increase of both the green and red curves. Unfortunately, the growth rate of the regret will keep increasing as√T and will never become logarithmic unlikeSCAL(orUCRLwhen the MDP is communicating). This is because the conditionN_k±≤ptk/SAwill always be triggered Θ(√T ) times for any T . In Sec. 5 we show that this is not just a drawback specific toTUCRL, but it is rather an intrinsic limitation of learning in weakly-communicating MDPs.

5 Exploration-exploitation dilemma with infinite diameter

In this section we further investigate the empirical difference betweenSCALandTUCRLand prove an impossibility result characterising the exploration-exploitation dilemma when the diameter is allowed to be infinite and no prior knowledge on the optimal bias span is available.

(10)

We first recall that the expected regretE[∆(UCRL, M, T )] ofUCRL(with input parameterδ = 1/3T ) afterT _{≥ 1 time steps and for any finite MDP M can be bounded in several ways:}

E[∆(UCRL, M, T )]_≤     

rmaxT (by definition)

C1· rmaxD p ΓSAT ln(3T2_{) +}1 3 [2, Theorem 2] C2· rmaxD 2_ΓSA γ ln(T ) + C3(M ) [2, Theorem 4] (7) whereγ = g∗

M − maxs,π{gMπ (s) : gMπ (s) < gM∗ } is the gap in gain, C1:= 34 and C2 := 342are

numerical constants independent ofM , and C3(M ) := O(maxπ:π(s)=aTπ) with Tπa measure of

the “mixing time” of policyπ. The three different bounds lead to three different growth rates for the functionT 7−→ E[∆(UCRL, M, T )] (see Fig. 4a): 1) for T_M† ≥ T ≥ 0, the expected regret is linear inT , 2) for T∗

M ≥ T ≥ TM† the expected regret grows as

√

T , 3) finally for T _{≥ T}∗

M, the increase in

regret is only logarithmic inT . These different “regimes” can be observed empirically (see [6, Fig. 5, 12]). Using (7), it is easy to show that the time it takes forUCRLto achieve sub-linear regret is at mostT_M† = eO(D2_{ΓSA). We say that an algorithm is efficient when it achieves sublinear regret}

after a number of steps that is polynomial in the parameters of the MDP (i.e.,UCRLis then efficient). We now show with an example that without prior knowledge, any efficient learning algorithm must satisfyT∗

M = +∞ when M has infinite diameter (i.e., it cannot achieve logarithmic regret).

Example 1. We consider a family of weakly-communicating MDPs_{M = (M}ε)ε∈[0,1]represented

on Fig. 4(right). Every MDP instance in_{M is characterised by a specific value of ε ∈ [0, 1] which} corresponds to the probability to go fromx to y. For ε > 0 (Fig. 4b), the optimal policy of Mεis

such thatπ∗_{(x) = b and the optimal gain is g}∗

ε = 1 while for ε = 0 (Fig. 4c) the optimal policy is

such thatπ∗_{(x) = d and the optimal gain is g}∗

0 = 1/2. We assume that the learning agent knows

that the true MDPM∗belongs to_{M but does not know the value ε}∗associated toM∗= Mε∗. We

assume that all rewards are deterministic and that the agent starts in statex (coloured in grey). Lemma 3. LetC1, C2, α, β > 0 be positive real numbers and f a function defined for all ε∈]0, 1]

byf (ε) = C1(1/ε)α. There exists no learning algorithm AT (with known horizonT ) satisfying both

1. for allε∈]0, 1], there exists T†

ε ≤ f(ε) such that E[∆(AT, Mε, x, T )] < 1/6· T for all T ≥ Tε†,

2. and there existsT0∗< +∞ such that E[∆(AT, M0, x, T )]≤ C2(ln(T ))βfor allT ≥ T0∗.

Note that point 1 in Lem. 3 formalizes the concept of “efficient learnability” introduced by Sutton and Barto [26, Section 11.6] i.e., “learnable within a polynomial rather than exponential number of time steps”. All the MDPs in_{M share the same number of states S = 2 ≥ Γ, number of actions} A = 2, and gap in average reward γ = 1/2. As a result, any function of S, Γ, A and γ will be considered as constant. Forε > 0, the diameter coincides with the optimal bias span of the MDP and D = spS{h∗} = 1/ε < +∞, while for ε = 0, D = +∞ but spS{h∗} = 1/2. As shown in Eq. 7

and Thm. 1,UCRLandTUCRLsatisfy property 1. of Lem. 3 withα = 2 and C1= O(S2A) but do

not satisfy 2. On the other hand,SCALsatisfies 2. withβ = 1 and C2= O(H2SA/γ) (although this

result is not available in the literature, it is straightforward to adapt the proof ofUCRL[2, Theorem 4] toSCAL) but since [6, Theorem 12] holds only whenH _{≥ sp}S{h∗},SCALonly satisfies 1. for

ε≥ 1/H and ε = 0 (not for ε ∈]0, 1/H[). Lem. 3 proves that no algorithm can actually achieve both 1.and 2. As a result, sinceTUCRLsatisfies 1., it cannot satisfy 2. This matches the empirical results presented in Sec. 4 where we observed that when the diameter is infinite, the growth rates of the regret ofSCALandTUCRLwere respectively logarithmic and of orderΘ(√T ). An algorithm that does not satisfy 1. could potentially satisfy 2. but, by definition of 1., it would suffer linear regret for a number of steps that is more than polynomial in the parameters of the MDP (more precisely,eD1/β

). This is not a very desirable property and we claim that an efficient learning algorithm should always prefer finite time guarantees(1.) over asymptotic guarantees (2.) when they cannot be accommodated.

6 Conclusion

We introducedTUCRL, an algorithm that efficiently balances exploration and exploitation in weakly-communicating and multi-chain MDPs, when the starting states1belongs to a communicating set

(Asm. 1). We showed thatTUCRLachieves a square-root regret bound and that, in the general case, it is not possible to design algorithm with logarithmic regret and polynomial dependence on the MDP parameters. Several questions remain open: 1) relaxing Asm. 1 by considering a transient initial state (i.e.,s1 ∈ ST), 2) refining the lower bound of Jaksch et al. [2] to finally understand whether it is

possible to scale withsp_S_{h∗_{} (at least in communicating MDPs) instead of D without any prior}

(11)

Acknowledgments

This research was supported in part by French Ministry of Higher Education and Research, Nord-Pas-de-Calais Regional Council and French National Research Agency (ANR) under project ExTra-Learn (n.ANR-14-CE24-0010-01).

References

[1] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press Cambridge, 1998.

[2] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563–1600, 2010.

[3] Peter L. Bartlett and Ambuj Tewari. REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In UAI, pages 35–42. AUAI Press, 2009.

[4] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Emma Brunskill. Regret minimization in mdps with options without prior knowledge. In NIPS, pages 3169–3179, 2017.

[5] Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-aware regret bounds for undiscounted reinforcement learning in mdps. In ALT, volume 83 of Proceedings of Machine Learning Research, pages 770–805. PMLR, 2018.

[6] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. CoRR, abs/1802.04020, 2018. [7] William R. Thompson. On the likelihood that one unknown probability exceeds another in view

of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.

[8] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. In NIPS, pages 3003–3011, 2013.

[9] Yasin Abbasi-Yadkori and Csaba Szepesvári. Bayesian optimal control of smoothly parameter-ized systems. In UAI, pages 1–11. AUAI Press, 2015.

[10] Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In ICML, volume 70 of Proceedings of Machine Learning Research, pages 2701–2710. PMLR, 2017.

[11] Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown markov decision processes: A thompson sampling approach. In NIPS, pages 1333–1342, 2017. [12] Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning:

worst-case regret bounds. In NIPS, pages 1184–1194, 2017.

[13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Pe-tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.

[14] Andrew William Moore. Efficient memory-based learning for robot control. Technical report, University of Cambridge, 1990.

[15] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pages 322–331, Oct 1995. doi: 10.1109/SFCS.1995.492488.

[16] Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized markov decision processes. In COLT, volume 40 of JMLR Workshop and Conference Proceedings, pages 861–898. JMLR.org, 2015.

(12)

[17] Yasin Abbasi-Yadkori and Csaba Szepesvári. Bayesian optimal control of smoothly param-eterized systems. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI’15, pages 2–11, Arlington, Virginia, United States, 2015. AUAI Press. ISBN 978-0-9966431-0-8.

[18] Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown markov decision processes: A thompson sampling approach. In Advances in Neural Information Processing Systems 30, pages 1333–1342. Curran Associates, Inc., 2017.

[19] Georgios Theocharous, Zheng Wen, Yasin Abbasi-Yadkori, and Nikos Vlassis. Posterior sampling for large scale reinforcement learning. CoRR, abs/1711.07979, 2017.

[20] Ian Osband and Benjamin Van Roy. Posterior sampling for reinforcement learning without episodes. CoRR, abs/1608.02731, 2016.

[21] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1994. ISBN 0471619779.

[22] Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Tuning bandit algorithms in stochastic environments. In Algorithmic Learning Theory, pages 150–165, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg.

[23] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample-variance penalization. In COLT, 2009.

[24] Thomas G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. J. Artif. Intell. Res., 13:227–303, 2000.

[25] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. CoRR, abs/1606.01540, 2016.

[26] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. Adaptive computation and machine learning. MIT Press, second edition, 2018. ISBN 9780262039246. [27] Odalric-Ambrym Maillard, Phuong Nguyen, Ronald Ortner, and Daniil Ryabko. Optimal regret

bounds for selecting the state representation in reinforcement learning. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 543–551, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.

(13)

A

Mistake in the regret bound of

REGAL.D

A.1 Regularized optimistic RL (REGAL)

In weakly communicating MDPs, to avoid the over-optimism of UCRL, Bartlett and Tewari [3] proposed to penalise the optimism on g∗ _{by the optimal bias span}_sp

S{h∗}. Formally, at each

episodek, their algorithm –REGAL– solves the following optimization problem:

f

Mk = arg max M∈Mk

{g∗M− Ck· spS{h∗M}} (8)

whereCk ≥ 0 is a regularisation coefficient. Note that such optimization requires to first compute

the optimal policy for a given MDPM ∈ Mkand then evaluate the regularized gain. Implicitly, this

defines the optimistic policy_eπk= arg maxπ∈ΠSD{gπ

f

Mk}. The term spS{h

∗_{} can be interpreted as a}

measure of the complexity of the environment: the biggerspS{h∗}, the more difficult it is to achieve

the stationary rewardg∗_{by following the optimal policy. In supervised learning, regularisation is}

often used to penalise the objective function by a measure of the complexity of the model so as to avoid overfitting. It is thus reasonable to expect that over-optimism in online RL can also be avoided through regularisation.

The regret bound ofREGAL holds only whenCk is set toΘ(1/Ps,aνk(s, a)). This means that

REGALrequires the knowledge of (future) visit countsνk(s, a) before episode k begins in order to

tune the regularisation coefficientCk. Unfortunately, an episode stops when the number of visits in a

state-action pair(s, a)_{∈ S × A has doubled and it is not possible to predict the future sequence of} states of a given policy for two reasons: 1) the true MDPM∗_{is unknown and 2) what is observed}

is a random sampled trajectory (as opposed to expected). As a result,REGALis not implementable.

Bartlett and Tewari [3] proposed an alternative algorithm –REGAL.D– that leverages on the doubling trickto guess the length of episodek (i.e.,P_s,aνk(s, a)) and proved a slightly worse regret bound

than forREGAL.REGAL.Ddivides an episodek into sub-iterations where it applies the doubling trick techniques. At each sub-iterationj,REGAL.Dguesses that the length of the episode will be at most 2j_{and it solves problem (8) with}_C

k,j ∝ 1/

√

2j_{. Then, it executes the optimistic policy}_eπ

k,jon the

true MDP until theUCRLstopping condition is reached or2j_{steps are performed. In the first case}

the episodek ends since the guess was correct, while, in the second case, a new sub-iteration j + 1 is started. This implies that for anyk, j:

X

s,a

νk,j(s, a)≤ 2j, (9)

whereνk,j(s, a) denotes the number of visits to (s, a) during episode k and sub-iteration j.

A.2 The doubling trick issue

The mistake inREGAL.Dis located in the proof of the regret [3, Theorem 3] (see Sec. 6.3). Let ehk,j

denote the optimistic bias span at episodek and sub-iteration j induced by the doubling trick. At a high level, the mistake comes from the attempt to upper-bound the termx_·P_s,aνk,j(s, a) by x· 2j

(for a givenx) using the fact theP_s,aνk,j(s, a)≤ 2j. Unfortunately, this is possible only under the

assumption thatx_{≥ 0 that does not hold in the case of}REGAL.D.

Formally, while boundingP_k_∈G∆k, the authors have to deal with the term (derived by the

combina-tion of [3, Eq. 15] and [3, Lem. 11] with [3, Eq. 14]):

U :=X k∈G X j spS n ehk,j o csX s,a νk,j(s, a)− Ck,j X s,a νk,j(s, a)  

wherec := 2Sp12 ln(2AT /δ) +p2 ln(1/δ)_{≥ 0 and recall that}P_s,aνk,j(s, a) denotes the actual

length of the episodek at sub-iteration j. In theREGAL.Dproof the authors directly replaced the actual length of the episode with the guessed length2j _{:= `}

k,j showing that the first term can be

upper-bounded byc·qPs,aνk,j(s, a)≤ c ·

√

2j _{(due to Eq. 9). Concerning the second term, they write}

−Ck,jP_s,aνk,j(s, a)≤− Ck,j2j. Since−Ck,j:=−c/

√

(14)

not true and the reverse inequality holds instead (using Eq. 9): _−Ck,jP_s,aνk,j(s, a)≥− Ck,j2j.

Therefore, it is not possible to guarantee thatU _{≤ 0 as claimed by Bartlett and Tewari [3] (the} authors probably didn’t pay attention to the sign). To do this, we would need to lower-bound P

s,aνk,j(s, a). Unfortunately, the only lower bound with probability 1 available for that term is

mins,a{Nk(s, a)} + 2. This is not big enough to cancel the term cqP_s,aνk,j(s, a) and Ck,jneeds

to be increased. As a result, the termspS{h?}Pk∈G

P

jCk,j

qP

s,aνk,j(s, a) becomes too big

and all the proof collapses.

Notice that a similar mistake is contained in the work by Maillard et al. [27] where they use a regularized approach to learn a state representation in online settings. Similarly to [3], the authors have to bound the termP_s,aνk,j(s, a)(g∗− egk,j). By exploiting the fact that g∗− egk,j ≤ α (we

omit the full expression ofα for sake of clarity) [27, Eq. 17 Sec. 5.2] the authors derived the bound P

s,aνk,j(s, a)(g∗− egk,j)≤ 2j· α [27, Eq. 18]. The difference g∗− egk,jmight be negative in which

case the result does not hold. Actually for the case in which there is no regularizationCk,j = 0,

g∗_{≤ eg}

k,jwhich is what is used in the regret proof ofUCRL. Therefore, it is very likely that the sign

ofg∗− egk,jcan sometimes be negative.

In conclusion, it seems unavoidable to use a lower-bound (and not an upper-bound) onP_s,aνk,j(s, a)

to derive a correct regret bound forREGAL.D. As already mentioned, given the current stopping condition of an episode, the only reasonable lower bound ismins,a{Nk(s, a)} + 2 and it does not

seem sufficient to derive a sensible regret bound. Another research direction could be to change the stopping condition. However, one of the terms in the regret bound ofREGAL(and ofREGAL.D) scales asm√T log2(T ) where m is the number of episodes. The term m is highly sensitive to the stopping

condition and there is very little margin if we want to avoidm√T log2(T ) to become the leading

term in the regret bound. All the efforts we put in this direction were unsuccessful. We conjecture that regularising by the optimal bias span might not allow to learn MDPs with infinite diameter.

B

Unbounded optimal bias span with continuous Bayesian priors/posteriors

Recently, Ouyang et al. [18] and Theocharous et al. [19] proposed posterior sampling algorithms and proved bounds on the expected Bayesian regret. The regret bounds that they derive scale linearly with H, where H is the highest optimal bias span of all the MDPs that can be drawn from the prior/posterior distribution. Formally, letf (θ) be the density function of the prior/posterior distribution over the family of MDPs(Mθ) parametrised by θ. Then:

H := sup

θ:f (θ)>0{spS{h ∗ θ}}.

In this section we present an example whereH is infinite and argue that it is probably the case for most priors/posteriors used in practice.

Example 2 (Unbounded optimal bias span with continuous prior/posterior). Consider the example of Fig. 5. There is only one action in every state and so one optimal policy. The (unique) action that can be played in states0loops ons0with probability1− θ and goes to s1with probabilityθ. The reward

associated to this action is0. Symmetrically, the (unique) action that can be played in state s1loops

ons1with probability1− θ and goes to s1with probabilityθ. The reward associated to this action is

1. This MDP is characterised by the parameter θ and we denote it by Mθ. For anyθ∈ [0, 1], we

denote bygθ∗(resp.h∗θ) the optimal gain (resp. bias) ofMθ. Observe that whenθ > 0, Mθis ergodic

x y r = 0 r = 1 θ 1_{− θ} θ 1− θ

(15)

and therefore the optimal gaingθ∗= 1/2 is state-independent whereas when θ = 0, Mθis multichain

and the optimal gain does depend on the initial state:g∗

0(x) = 0 < 1 = g0∗(y).

Let’s assume that the prior/posterior distribution we use onMθis characterised by a probability

density functionf satisfying f (θ) > 0 for all θ > 0 and f (0) = 0. Note that this assumption does not constrain the “smoothness” off e.g., f can have continuous derivatives of all orders. Under this assumption,f is non-zero only for ergodic MDPs. It goes without saying that for all θ_{∈ [0, 1] (0} included),spS{h∗θ} < +∞ by definition (the optimal bias span is always finite). More precisely we

have: gθ∗= _{[1/2, 1/2]}T _if _{θ > 0} [0, 1]T _if _{θ = 0} and spS{h∗θ} = 1 2θ if θ > 0 0 if θ = 0 As a result, althoughsp_S_{h∗

θ} is always finite, i.e., ∀θ ∈ [0, 1], spS{h∗θ} < +∞, it is unbounded

on the set of plausible MDPsθ∈]0, 1] satisfying f(θ) > 0, i.e., H := sup θ∈]0,1]{spS{h ∗ θ}} = lim θ→0+ 1 2θ = +∞

Therefore, the regret bound eO(HS√AT ) proved by Ouyang et al. [18], Theocharous et al. [19] does not hold with prior/posteriorf since H = +_{∞. One might argue that the proofs in [18, 19] could} be fixed by showing thatH is bounded with probability 1. Unfortunately, for any C ∈ [0, +∞[, the probabilityP(sp_S_{h∗_θ_{} ≥ C) =}R2C1

θ=0f (θ)dθ > 0 of sampling an MDP with spS{h∗θ} ≥ C is

strictly positive. We therefore conjecture that for this specific choice of priors/posteriors, the regret proof in [18, 19] cannot be fixed without major changes and new arguments. More generally, let’s imagine that we have a prior/posterior distributionf satisfying:

• there existsθ0such thatMθ0has non-constant gain i.e.,spS

g∗θ0

> 0,

• there exists an open neighbourhood ofθ0denotedΘ0such that∀θ ∈ Θ0,Mθhas constant

gain (e.g.,Mθis weakly-communicating) andf (θ) > 0.

In this case we will face the same problem as in Ex. 2 i.e., sup

θ: f (θ)>0{spS{h ∗

θ}} = +∞ and ∀C ∈ [0, +∞[, P(spS{h∗θ} ≥ C) > 0

When the set of plausible MDPs is finite, this problem cannot occur. But most priors/posteriors used in practice are continuous distributions. For instance, a Dirichlet distribution will most likely satisfy the above assumptions.

C

Algorithmic Details

For technical reasons (see App. E), we consider a slight relaxation of the optimization problem (5) in which_Mk is replaced by a relaxed extended MDPM

+

k ⊇ Mk defined by using`1-norm

con-centration inequalities forp(_{·|s, a). Let B}_p,k+ (s, a) = _{ep(_{·|s, a) ∈ C : ke}p(_{·|s, a) − b}p(_{·|s, a)k}1 ≤

P

s0βsas 0

p,k } (resp. B +

p,k) be the relaxed confidence interval, thenM+k (resp.M +

k) is the

correspond-ing (relaxed) set of plausible MDPs. This relaxed optimistic optimization problem is solved by running extended value iteration (EVI) on_M+_k (up to accuracyk = rmax/√tk). Technically, we

restrict EVI to work on the set of states_SEVI

k that are optimistically reachable from the communicating

set_SC

k. In practice,SkEVI =SkCwhenKk=∅ since all the transitions to SkTare forbidden, otherwise

SEVI

k =S. Alg. 1 shows this variation of EVI that we name Truncated EVI. Then, at each episode k,

TUCRLrunsTEVIwith the following parameters:(_egk, ehk,eπk) =TEVI(0,M +

k,SkEVI, k). Starting

from an initial vectorv0 = 0, TEVI iteratively applies (on a subsetSkEVI of states) the optimal

Bellman operator eL_M+

k associated to the (extended) MDPM

+ k defined as ∀v ∈ RS_, _L_e M+kv(s) := maxa∈As max e r∈Br,k(s,a)er + (e psa₎T_v , (10) where _epsa _{= arg max}

e

p∈B+p,k(s,a){ep

T_v

} can be solved using [2, Fig. 2], except for (s, a) /∈ Kk

for which we force p_esa_(s0_{) := 0 for any s}0 _{∈ S}T

(16)

Algorithm 1 TRUNCATEDEXTENDEDVALUEITERATION(TEVI) Input: value vectorv0, extended MDPM, set of states S, accuracy

Output:gn,vn,πn n := 0 v1(s) := eLMv0(s) := maxa∈As max e

r∈Br(s,a)er + maxpe∈Bp(s,a)pe T_v

0 ,∀s ∈ S (see App. C)

whilemaxs∈S{vn+1(s)− vn(s)} − mins∈S{vn+1(s)− vn(s)} > do

n := n + 1

vn+1(s) := eLMvn(s),∀s ∈ S

end while

gn:= 12 maxs∈S{vn+1(s)− vn(s)} + mins∈S{vn+1(s)− vn(s)}

πn(s)∈ arg maxa∈As

max_er∈Br(s,a)er + maxpe∈Bp(s,a)pe T_v

n ,∀s ∈ S

sp_SEVI

k {vn+1− vn} ≤ kand the true MDP is sufficiently explored, then the greedy policyeπk:= πn

w.r.t.vn isk-optimistic, i.e.,egk := gn ≥ g∗M∗− k (see Sec. 3.1 for details). The policyeπk is

then executed until the number of visits to a state-action pair is doubled or a new state is “discov-ered” (i.e.,st∈ SkTt). Note that the conditionspSkEVI{vn+1− vn} ≤ k is always met after a finite

number of steps since the extended MDP_M+_k is communicating on the restricted state space_S_kEVI. Finally, notice that when the true MDPM∗is communicating, there exists an episodek s.t. for all k_{≥ k, S}T

k=∅ andTUCRLcan be reduced toUCRLby consideringMkin place ofM + k.

D

Regret of

TUCRL

We follow the proof structure of Jaksch et al. [2], Fruit et al. [6] and use similar notations. Nonetheless, several parts of the proof significantly differ from [2, 6]:

• in Sec. D.2 we prove that after a finite number of steps,TUCRLis gain-optimistic (which is not as straightforward as in the case ofUCRL),

• in Sec. D.3 we show that the sums taken over the whole state space_{S that appear in the main} term of the regret decomposition ofUCRLcan be restricted to sums over_SC

kthanks to the

new stopping condition used for episodes and the use of the conditionN_k±(s, a) >ptk/SA

(see (18)),

• in Sec. D.4.1, we bound the number of time steps spent in “bad” state-action pairs(s, a) satisfyingN_k±(s, a)_≤ptk/_SA,

• in Sec. D.4.3, we bound the number of episodes with the new stopping condition. D.1 Splitting into episodes

The regret ofTUCRLafterT time steps is defined as: ∆(TUCRL, T ) := T g∗₋PT

t=1rt(st, at).

Defining∆k =Ps∈S,a∈Aνk(s, a) (g∗− r(s, a)) and using the same arguments as in [2, 6], it holds

with probability1− δ 12T4/5 that: ∆(TUCRL, T )_≤ m X k=1 ∆k+ rmax s 5 2T ln _8T δ (11) D.2 Episodes withM∗∈ Mk

We now assume thatM∗ _{∈ M}

k. As done in App. C, let’s denote byegk, ehkandeπk the outputs of

TEVI(0,M

k,SkEVI, εk) (see Alg. 1) where εk := rmax/√tkand

SEVI k = SC k ifKk =∅ S otherwise, M k = ( Mk =Mk ifSkT=∅ M+k otherwise . (12)

(17)

TEVIreturns an approximate solution of a slightly modified version of Problem 5: ( fMk,eπk) = arg max

M∈M k,π

{gπ M}.

In order to bound∆k we first show thategk & g∗(up tormax/√tk-accuracy). IfSkT =∅ then by

definition_M_k =_Mk=Mk3 M∗and so we can use the same argument as in [2, Sec. 4.3 & Thm.

7]. If_ST

k 6= ∅, the true MDP M∗might not be “included” in the extended MDPM +

k considered by

EVIand we cannot use the same argument. To overcome this problem we first assume thattkis big

enough which allows us to prove a useful lemma (Lem. 4): tk≥ 2401 9 DC2_SA ST kln _2SAt k δ 2 := C(k) (13) whereST k:=|SkT| is the cardinal of SkT.

Lemma 4. Let episodek be such that M∗_{∈ M}

k,SkT6= ∅ and (13) holds. Then,

∀(s, a) ∈ SC k× A, Nk±(s, a) > r tk SA ! =_{⇒ S}T k=ST

Proof. Assume that episodek is such that (13) holds and that for any state-action pair (s, a)∈ SC k×A N_k±(s, a) > r tk SA ≥ 49 3 D C_ST kln _2SAt k δ Since_ST

k 6= ∅ and M∗∈ Mk, for any(s, a, s0)∈ SkC× A × SkT

p(s0_{|s, a)} | {z } transition probability inM∗ ≤ bpk(s0|s, a) | {z } =0 +βsask 0 = s 14bσ2 p,k(s0|s, a) ln(2SAtk/δ) N_k+(s, a) | {z } =0 +49 ln(2SAtk/δ) 3N_k±(s, a) ≤49 ln (2SAtk/δ) 3N_k±(s, a) < 1 DC_ST k

where we have exploited the fact thatp(s_b 0_{|s, a) = 0 and bσ}2

p,k(s0|s, a) = 0 for any state s0 ∈ SkT

(remember thatNk(s, a, s0) = 0).

We denote byτM∗(s → s0) the shortest path between any pair of states (s, s0) ∈ S × S in the

true MDP M∗_{. Fix an arbitrary target state}_s _{∈ S}T

k and denote byτ (s) := τM∗(s → s) and

τmin:= mins∈SC k{τ(s)}. We have τ (s) = 0 ∀s ∈ SC k τ (s) = 1 + min a∈As      X s0_∈S p(s0_{|s, a)τ(s}0) | {z } ≥0     ≥ 1 + mina∈As      X s0_∈SC k p(s0_{|s, a) τ(s}0) | {z } ≥τmin      ≥ 1 + τmin· min a∈A    X s0_∈SC k p(s0_{|s, a)}  

= 1 + τmin· mina∈A

       1₋ X s0_∈ST k p(s0_{|s, a)} | {z } < 1 DCST_k        > 1 + τmin  1 − X s0_∈ST k 1 DC_ST k   = 1 + τmin 1₋ 1 DC

Applying the above inequality to_{es ∈ S}C

kachievingτ (es) = τminyieldsτmin> DC. This implies that

the shortest path inM∗_{between any state}_s_{∈ S}C

k ⊆ SCand any state ins∈ SkTis strictly bigger

thanDC_{but by definition}_DC_{is the longest shortest path between any pair of states in}_SC_{. Therefore,}

s∈ ST_{. Since}_s_{∈ S}T

(18)

As a consequence of Lem. 4, under the assumptions thatM∗∈ Mk,SkT6= ∅ and (13) holds, there

are only two possible cases: 1. Either_ST k =ST, 2. or_{∃(s, a) ∈ S}C k× A : Nk±(s, a)≤ q tk SA. Case 1:_ST k =STimplies thatM∗∈ M +

k. This is because for any(s, a, s0)∈ SkC× A × SkTwe have

p(s0_{|s, a) = e}_p

k(s0|s, a) = 0 and for any (s, a, s0) /∈ SkC×A×SkTwe have|p(s0|s, a) − bpk(s0|s, a)| ≤

βsas0

p,k and sop(·|s, a) ∈ B +

p,k(s, a). Since M∗∈ M +

k, we can use the same argument as Jaksch et al.

[2, Sec. 4.3 & Theorem 7] to prove_egk ≥ g∗−r√max_t

k.

Case 2: For any(s, a) _{∈ S}T

k × A, B +

p,k(s, a) = C is the (S − 1)-simplex denoting the maximal

uncertainty about the transition probabilities, andBr,k(s, a) = [0, rmax]. We will now construct

an MDPM0 ∈ M+k with optimal gain rmax. For all(s, a) ∈ SkT× A, we set the transitions to

pM0(s|s, a) = 1 and rewards to r_M0(s, a) = r_max. Let(s, a)∈ S_kC× A such that N±

k(s, a)≤

q

tk

SA

(which exists by assumption). We setpM0(s0|s, a) > 0 for all s0∈ ST

k. This is possible because by

definition of_M+_k, the support ofp(·|s, a) is not restricted to SC

k. Finally, for all state-action pairs

(s, a)_{∈ S}C

k× A, we set pM0(s|s, a) > 0. This is possible because by definition of M

+

k, the support

ofp(·|s, a) is only restricted to SC

kands∈ SkC. InM0, for all policies, all states inSkTare absorbing

states (i.e., loop on themselves with probability 1) with maximal rewardrmaxand all other states

s_{∈ S}C

k are transient. The optimal gain ofM0is thusrmaxand sinceM0 ∈ M +

k we conclude that

egk ≥ rmax−r√max_t_k ≥ g∗−r√max_t_k.

In conclusion,TEVIis always returning an optimistic policy when the assumptions of Lem. 4 hold. The regret∆kaccumulated in episodek can thus be upper-bounded as:

∆k = X s,a νk(s, a)(g∗− r(s, a)) = X s,a νk(s, a)( g∗ |{z} ≤egk+rmax√ tk −erk(s, a)) + X s,a νk(s, a)(erk(s, a)− r(s, a)) ≤X s,a νk(s, a)(egk− erk(s, a)) | {z } := e∆k +X s,a

νk(s, a)(erk(s, a)− r(s, a)) + rmax

X

s,a

νk(s, a)

√_t

k

To bound the difference between the optimistic reward_erkand the true rewardr we introduce the

estimated reward_brk: ∀s, a ∈ S × A, erk(s, a)− r(s, a) = erk(s, a)− brk(s, a) | {z } ≤βsa r,kby construction +_brk(s, a)− r(s, a) | {z } ≤βsa r,ksinceM∈Mk ≤ 2βsa r,k and so in conclusion: ∆k ≤ e∆k+ 2 X s,a νk(s, a)βr,ksa + rmax X s,a νk(s, a) √ tk | {z } :=U1 k (14) D.3 Bounding e∆k

The goal of this section is to bound the term e∆k := Ps,aνk(s, a)(egk− erk(s, a)). We start by

discarding the state-action pairs(s, a)∈ Kkthat have been poorly visited so far:

e ∆k= X s,a νk(s, a)(egk− erk(s, a))1{(s, a) /∈ Kk} | {z } :=1k(s,a) +X s,a νk(s, a) (egk− erk(s, a)) | {z } ≤rmax 1{(s, a) ∈ Kk} ≤X s,a νk(s, a)(egk− erk(s, a))1k(s, a) | {z } := e∆0 k +rmax X s,a νk(s, a)1{(s, a) ∈ Kk} (15)

(19)

We will now bound the term e∆0k =

P

sνk(s,eπk(s))(egk− erk(s,eπk(s)))1k(s,eπk(s)). We recall that

the policy_eπkis obtained by executingTEVI(0,Mk,SkEVI, εk) (see Alg. 1) where εk := rmax/√tk

and_S_kEVIand_M_kare defined in (12). In all possible cases for both_S_kEVIand_M_k, this amounts to applying value iteration to a communicating MDP with finite state space_S_kEVIand compact action space. By [21, Thm. 8.5.6], since the convergence criterion of value iteration is met we have:

∀s ∈ SEVI k , ehk(s) +egk− erk(s,eπk(s))− X s0_∈S e pk(s0|s, eπk(s))ehk(s0) ≤ rmax √ tk (16) For alls /_{∈ S}C

k,νk(s,eπk(s)) = 0 due to the stopping condition of episode k. Therefore we can

plug (16) in e∆0kand derive an upper bound restricted to the setSkC⊆ SkEVI. Before to do that, we

further decompose e∆0 kas: e ∆0k≤ X s νk(s,eπk(s)) X s0_∈S e pk(s0|s, eπk(s))ehk(s0)− ehk(s) + rmax √_t k ! 1k(s,eπk(s)) = νk0 e Pk− I ehk+ rmax X s,a νk(s, a) √_t k 1k(s, a) (17)

whereνk0 = (νk(s,eπk(s))1k(s,eπk(s)))s∈Sis the vector of visit counts for each state and the

corre-sponding action chosen by_eπkmultiplied by the indicator function1k, ePk = ( ePk(s0|s, eπk(s)))s,s0_∈S

is transition matrix associated to _eπk inM +

k andI is the identity matrix. We now focus on the

termνk0( ePk− I)ehk. Since the rows of ePk sum to 1,∀λ ∈ R, Pek− Iehk = ePk− I ehk+ λe

where e = (1, . . . 1)| _{is the vector of all ones. Let’s take}_{λ :=} _{− min} s∈SC

k{ehk(s)} and define

wk:= ehk+ λe so that for all s∈ SkC,wk(s)≥ 0 and mins∈SC

k{wk(s)} = 0. We have: νk0( ePk− I)ehk= X s∈S νk(s,eπk(s))1k(s,eπk(s)) X s0_∈S e pk(s0|s, eπk(s))wk(s0)− wk(s) !

We denote bykt:= sup{k ≥ 1 : tk≤ t} the current episode at time t. Whenever st∈ SkTt, episode

ktstops before executing any action (see the stopping condition ofTUCRLin Fig. 2) implying that

∀s ∈ ST

k,νk(s,eπk(s)) = 0. Therefore we have:

νk0( ePk− I)ehk= X s∈SC k νk(s,eπk(s))1k(s,eπk(s)) X s0_∈S e pk(s0|s, eπk(s))wk(s0)− wk(s) !

For all statess such that1k(s,eπk(s)) = 1, i.e., satisfying N_k±(s,eπk(s)) >ptk/SA, we forceTEVI

to setp_ek(s0|s, eπk(s)) = 0,∀s0∈ SkT, by construction ofM + k so that: νk0( ePk− I)ehk = X s∈SC k νk(s,eπk(s))1k(s,eπk(s))  X s0_∈SC k e pk(s0|s, eπk(s))wk(s0)− wk(s)   (18)

We can now introducep: X s0_∈SC k e pk(s0|s, eπk(s))wk(s0)− wk(s) = X s0_∈SC k e pk(s0|s, eπk(s))wk(s0)− p(s0|s, eπk(s))wk(s0) (19) +  X s0_∈SC k p(s0_{|s, eπ}k(s))wk(s0)− wk(s)   (20) By definition_SC

k ⊆ SC and using(1,∞)-Hölder’s inequality , the term (19) can be bounded as

(19) _{≤ ke}pk(·|s, eπk(s))− p(·|s, eπk(s))k_1,SC · maxs0_∈SC k{wk(s

0₎_{} where for any vector v ∈ R}S_,

kvk1,SC := P

s∈SC|v(s)|. Define s ∈ arg max_s_∈SC

k{wk(s)} and es ∈ arg mins∈SCk{wk(s)}. By

definitions,es ∈ SC

kandwk(es) = mins∈SC

k{wk(s)} = 0. By Lem. 7, we know that for all s, s

0 _{∈ S}C k,