• Aucun résultat trouvé

Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning

N/A
N/A
Protected

Academic year: 2021

Partager "Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning"

Copied!
39
0
0

Texte intégral

(1)

HAL Id: hal-01941206

https://hal.inria.fr/hal-01941206

Submitted on 30 Nov 2018

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Efficient Bias-Span-Constrained

Exploration-Exploitation in Reinforcement Learning

Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, Ronald Ortner

To cite this version:

Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, Ronald Ortner. Efficient Bias-Span-Constrained

Exploration-Exploitation in Reinforcement Learning. ICML 2018 - The 35th International Conference

on Machine Learning, Jul 2018, Stockholm, Sweden. pp.1578-1586. �hal-01941206�

(2)

Efficient Bias-Span-Constrained Exploration-Exploitation

in Reinforcement Learning

Ronan Fruit* 1 Matteo Pirotta* 1 Alessandro Lazaric2 Ronald Ortner3

Abstract

We introduceSCAL, an algorithm designed to per-form efficient exploration-exploitation in any un-known weakly-communicatingMarkov decision process (MDP) for which an upper boundc on the span of the optimal bias function is known. For an MDP with S states, A actions and Γ ≤ S possible next states, we prove a regret bound of eO(c√ΓSAT ), which significantly improves over existing algorithms (e.g.,UCRLandPSRL), whose regret scales linearly with the MDP diam-eterD. In fact, the optimal bias span is finite and often much smaller thanD (e.g., D =∞ in non-communicating MDPs). A similar result was originally derived byBartlett and Tewari(2009) forREGAL.C, for which no tractable algorithm is available. In this paper, we relax the optimiza-tion problem at the core ofREGAL.C, we carefully analyze its properties, and we provide the first computationally efficient algorithmto solve it. Fi-nally, we report numerical simulations supporting our theoretical findings and showing howSCAL significantly outperformsUCRLin MDPs with largediameter and small span.

1. Introduction

While learning in an unknown environment, a reinforcement learning (RL) agent must trade off the exploration needed to collect information about the dynamics and reward, and the exploitation of the experience gathered so far to gain as much reward as possible. In this paper, we focus on the regret framework (Jaksch et al.,2010), which evaluates the exploration-exploitation performance by comparing the rewards accumulated by the agent and an optimal policy. A common approach to the exploration-exploitation dilemma is the optimism in face of uncertainty (OFU) principle: the agent maintains optimistic estimates of the value function

*

Equal contribution 1SequeL Team, INRIA Lille, France

2

Facebook AI Research, Paris, France3Montanuniversit¨at Leoben, Austria. Correspondence to: Ronan Fruit<[email protected]>. Proceedings of the35thInternational Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

and, at each step, it executes the policy with highest opti-mistic value (e.g.,Brafman and Tennenholtz,2002;Jaksch et al.,2010; Bartlett and Tewari, 2009). An alternative approach is posterior sampling (Thompson,1933), which maintains a Bayesian distribution over MDPs (i.e., dynam-ics and expected reward) and, at each step, samples an MDP and executes the corresponding optimal policy (e.g.,Osband et al.,2013;Abbasi-Yadkori and Szepesv´ari,2015;Osband and Roy,2017;Ouyang et al.,2017;Agrawal and Jia,2017). Given a finite MDP withS states, A actions, and diameter D (i.e., the time needed to connect any two states),Jaksch et al.

(2010) proved that no algorithm can achieve regret smaller thanΩ(√DSAT ). While recent work successfully closed the gap between upper and lower bounds w.r.t. the depen-dency on the number of states (e.g.,Agrawal and Jia,2017;

Azar et al.,2017), relatively little attention has been devoted to the dependency onD. While the diameter quantifies the number of steps needed to “recover” from a bad state in the worst case, the actual regret incurred while “recovering” is related to the difference in potential reward between “bad” and “good” states, which is accurately measured by the span (i.e., the range)sp{h

} of the optimal bias function h∗.

While the diameter is an upper bound on the bias span, it could be arbitrarily larger (e.g., weakly-communicating MDPs may have finite span and infinite diameter) thus sug-gesting that algorithms whose regret scales with the span may perform significantly better.1 Building on the idea that

the OFU principle should be mitigated by the bias span of the optimistic solution,Bartlett and Tewari(2009) proposed three different algorithms (referred to asREGAL) achieving

regret scaling withsp{h∗} instead of D. The first algorithm

defines a span regularized problem, where the regulariza-tion constant needs to be carefully tuned depending on the state-action pairs visited in the future, which makes it unfea-sible in practice. Alternatively, they propose a constrained variant, calledREGAL.C, where the regularized problem is replaced by a constraint on the span. Assuming that an upper-boundc on the bias span of the optimal policy is

1

The proof of the lower bound relies on the construction of an MDP whose diameter actually coincides with the bias span (up to a multiplicative numerical constant), thus leaving the open question whether the “actual” lower bound depends onD or the bias span. See (Osband and Roy,2016) for a more thorough discussion.

(3)

known (i.e.,sp{h

} ≤ c),REGAL.Cachieves regret upper-bounded by eO(min{D, c}S√AT ). Unfortunately, they do not propose any computationally tractable algorithm solving the constrained optimization problem, which may even be ill-posed in some cases. Finally,REGAL.Davoids the need of knowing the future visits by using a doubling trick, but still requires solving a regularized problem, for which no computationally tractable algorithm is known.

In this paper, we build on REGAL.C and propose a con-strained optimization problem for which we derive a com-putationally efficient algorithm, calledSCOPT. We identify conditions under which SCOPT converges to the optimal

solution and propose a suitable stopping criterion to achieve anε-optimal policy. Finally, we show that using a slightly modified optimistic argument, the convergence conditions are always satisfied and the learning algorithm obtained by integratingSCOPTinto aUCRL-like scheme (resulting into SCAL) achieves regret scaling as eO(min{D, c}√ΓSAT ) when an upper-boundc on the optimal bias span is available, thus providing the first computationally tractable algorithm that can solve weakly-communicating MDPs.

2. Preliminaries

We consider a finite weakly-communicating Markov deci-sion process (Puterman,1994, Sec. 8.3)M =hS, A, r, pi with a set of statesS and a set of actions A =S

s∈SAs.

Each state-action pair(s, a) ∈ S × As is characterized

by a reward distribution with mean r(s, a) and support in[0, rmax] as well as a transition probability distribution

p(·|s, a) over next states. We denote by S = |S| and A = maxs∈S|As| the number of states and actions, and

byΓ the maximum support of all transition probabilities. A Markov randomized decision ruled :S → P(A) maps states to distributions over actions. The corresponding set is denoted byDMR, while the subset of Markov deterministic

decision rules isDMD. A stationary policyπ = (d, d, . . .)

=: d∞ repeatedly applies the same decision ruled over

time. The set of stationary policies defined by Markov ran-domized (resp. deterministic) decision rules is denoted by ΠSR(M ) (resp. ΠSD(M )). The long-term average reward

(or gain) of a policyπ∈ ΠSR(M ) starting from s

∈ S is gπ M(s) := lim T →+∞EQ " 1 T T X t=1 r(st, at) # ,

where Q := P (·|at∼ π(st); s0= s; M ). Any stationary

policyπ∈ ΠSRhas an associated bias function defined as

hπM(s) := C- lim T →+∞EQ " T X t=1 r(st, at)− gπM(st)  # , that measures the expected total difference between the reward and the stationary reward in Cesaro-limit2

(de-2

For policies with an aperiodic chain, the standard limit exists.

notedC- lim). Accordingly, the difference of bias values hπ

M(s)− hπM(s0) quantifies the (dis-)advantage of starting

in states rather than s0. In the following, we drop the

depen-dency onM whenever clear from the context and denote bysp{hπ

} := maxshπ(s)− minshπ(s) the span of the

bias function. In weakly communicating MDPs, any op-timal policyπ∗

∈ arg maxπgπ(s) has constant gain, i.e.,

gπ∗

(s) = g∗for alls∈ S. Let P

d∈ RS×Sandrd∈ RSbe

the transition matrix and reward vector associated with de-cision ruled∈ DMR. We denote byL

dandL the Bellman

operator associated withd and optimal Bellman operator ∀v ∈ RS, Ldv := rd+ Pdv; Lv := max

d∈DMRrd+ Pdv .

For any policyπ = d∞

∈ ΠSR, the gaingπ and biashπ

satisfy the following system of evaluation equations gπ= Pdgπ; hπ= Ldhπ− gπ. (1)

Moreover, there exists a policyπ∗

∈ arg maxπgπ(s) for

which(g∗, h) = (gπ∗

, hπ∗

) satisfy the optimality equation h∗= Lh∗− g∗e, where e = (1, . . . , 1)|. (2)

Finally, we denote byD := max(s,s0)∈S×S,s6=s0{τM(s→

s0)

} the diameter of M, where τM(s→ s0) is the minimal

expected number of steps needed to reachs0froms in M .

Learning problem. LetM∗be the true unknown MDP. We

consider the learning problem whereS, A and rmax are

known, while rewardsr and transition probabilities p are unknownand need to be estimated on-line. We evaluate the performance of a learning algorithm A afterT time steps by its cumulative regret∆(A, T ) = T g∗−PT

t=1rt(st, at).

3. Optimistic Exploration-Exploitation

Since our proposed algorithmSCAL(Sec.6) is a tractable variant ofREGAL.Cand thus a modification ofUCRL, we first recall their common structure summarized in Fig.1.

3.1. Upper-Confidence Reinforcement Learning

UCRLproceeds through episodesk = 1, 2 . . . At the begin-ning of each episodek,UCRLcomputes a set of plausible MDPs defined asMk = M = hS, A,er,pei : er(s, a) ∈ Bk r(s, a), p(se 0|s, a) ∈ Bk p(s, a, s0), P s0p(se 0|s, a) = 1 , whereBk

r andBpkare high-probability confidence intervals

on the rewards and transition probabilities of the true MDP M∗, which guarantees thatM

∈ Mkw.h.p. We use

con-fidence intervals constructed using empirical Bernstein’s inequality (Audibert et al.,2007;Maurer and Pontil,2009)

βr,ksa := s 14bσ 2 r,k(s, a)bk,δ max{1, Nk(s, a)} + 49 3rmaxbk,δ max{1, Nk(s, a)− 1} , βp,ksas0 := s 14σb2 p,k(s0|s, a)bk,δ max{1, Nk(s, a)} + 49 3bk,δ max{1, Nk(s, a)− 1},

(4)

where Nk(s, a) is the number of visits in (s, a)

be-fore episode k, bσ2

r,k(s, a) and bσ

2

p,k(s0|s, a) are the

em-pirical variances of r(s, a) and p(s0

|s, a) and bk,δ =

ln(2SAtk/δ). Given the empirical averagesbrk(s, a) and b

pk(s0|s, a) of rewards and transitions, we define Mk by

Bk r(s, a) := [brk(s, a)−β sa r,k,brk(s, a)+β sa r,k]∩[0, rmax] and Bk p(s, a, s0) := [bpk(s 0|s, a) − βsas0 p,k ,pbk(s 0|s, a) + βsas0 p,k ]∩ [0, 1].

OnceMkhas been computed,UCRLfinds an approximate

solution( fM∗ k,eπ

k) to the optimization problem

( fMk∗,πe ∗ k)∈ arg max M ∈Mk,π∈ΠSD(M ) gMπ. (3) SinceM∗ ∈ Mk w.h.p., it holds thatg∗ f Mk ≥ g ∗ M∗. As

noticed byJaksch et al.(2010), problem (3) is equivalent to findingµe∈ arg max

µ∈ΠSD( fM k)g µ f Mk whereMfkis

the extended MDP (sometimes called bounded-parameter MDP) implicitly defined byMk. More precisely, in fMk

the (finite) action spaceA is “extended” to a compact ac-tion space eAk by considering every possible value of the

confidence intervalsBk

r(s, a) and Bkp(s, a, s0) as fictitious

actions. The equivalence between the two problems comes from the fact that for eachµe ∈ ΠSD( f

Mk) there exists a

pair ( fM ,eπ) such that the policiesπ ande µ induce the samee Markov reward process on respectively fM and fMk, and

conversely. Consequently, (3) can be solved by running so-called extended value iteration (EVI): starting from an initial vectoru0= 0,EVIrecursively computes

un+1(s) = max a,er,pe  e r(s, a) +p(e·|s, a)Tu n =Lue n(s), (4)

where eL is the optimistic optimal Bellman operator associ-ated to fMk. IfEVIis stopped whensp{un+1− un} ≤ εk,

then the greedy policyµek w.r.t.unis guaranteed to beεk

-optimal, i.e.,geµk f Mk ≥ g ∗ f Mk − εk ≥ g ∗ M∗− εk. Therefore,

the policyeπk associated toeµkis an optimisticεk-optimal policy, andUCRLexecutesπekuntil the end of episodek.

3.2. A first relaxation ofREGAL.C

REGAL.C follows the same steps asUCRLbut instead of solving problem (3), it tries to find the best optimistic model

f M∗

RC ∈ MRChaving constrained optimal bias span i.e.,

( fM∗ RC,πe ∗ RC) = arg max M ∈MRC,π∈ΠSD(M ) gπ M, (5)

whereMRC := {M ∈ Mk : sp{h∗M} ≤ c} is the set

of plausible MDPs with bias span of the optimal policy bounded byc. Under the assumption that sp{h

M∗} ≤ c,

REGAL.Cdiscards any MDPM ∈ Mkwhose optimal

pol-icy has a span larger thanc (i.e., sp{h

M} > c) and

other-wise looks for the MDP with highest optimal gaing∗(M ).

Unfortunately, there is no guarantee that all MDPs inMRC

are weakly communicating and thus have constant gain. As

Input: Confidenceδ∈]0, 1[, rmax,S, A, a constant c ≥ 0

For episodesk = 1, 2, ... do

1. Settk= t and episode counters νk(s, a) = 0.

2. Compute estimatespbk(s

0

|s, a),brk(s, a) and a confidence setMk(UCRL, REGAL.C), resp.Mk(SCAL).

3. Compute anrmax/√tk-approximationeπkof the solution of Eq.3(UCRL), resp. Eq.5(REGAL.C), resp. Eq.15(SCAL). 4. Sample actionat∼πek(·|st).

5. Whileνk(st, at)≤ max{1, Nk(st, at)} do

(a) Executeat, obtain rewardrt, and observe next statest+1.

(b) Setνk(st, at) += 1.

(c) Sample actionat+1∼πek(·|st+1) and set t += 1. 6. SetNk+1(s, a) = Nk(s, a) + νk(s, a).

Figure 1. The general structure of optimistic algorithms for RL. a result, we suspect this problem to be ill-posed (i.e., the maximum is most likely not well-defined). Moreover, even if it is well-posed, searching the spaceMRC seems to be

computationally intractable. Finally, for any M ∈ Mk,

there may be several optimal policies with different bias spans and some of them may not satisfy the optimality equa-tion (2) and are thus difficult to compute.

In this paper, we slightly modify problem (5) as follows: ( fMc∗,eπ

c)∈ arg max M ∈Mk,π∈Πc(M )

gπM, (6)

where the search space of policies is defined as

Πc(M ) :=π ∈ ΠSR: sp{hπM} ≤ c ∧ sp {gMπ } = 0 ,

and maxπ∈Πc(M ){g π

M} = −∞ if Πc(M ) = ∅.

Sim-ilarly to (3), problem (6) is equivalent to solving µe∗ c ∈ arg maxµ∈Π c( fMk)g µ f Mk

. Unlike (5), for every MDP in Mk(not just those inMRC), (6) considers all (stationary)

policies with constant gain satisfying the span constraint (not just the deterministic optimal policies).

Sincegπ

M andsp{hπM} are in general non-continuous

func-tions of (M , π), the argmax in (5) and (6) may not exist. Nevertheless, by reasoning in terms of supremum value, we can show that (6) is always a relaxation of (5) (where we enforce the additional constraint of constant gain).

Proposition 1. Define the following restricted set of MDPs Ek =MRC∩ {M ∈ Mk : sp{g∗M} = 0}. Then sup M ∈Ek,π∈ΠSD gπM ≤ sup M ∈Mk,π∈Πc(M ) gMπ.

Proof. The result follows from the fact thatEk ⊆ Mkand

∀M ∈ Ek,arg maxπ∈ΠSD{gMπ } ⊆ Πc(M ).

As a result, the optimism principle is preserved when moving from (5) to (6) and since the set of admissible MDPsMk

is the same, any algorithm solving (6) would enjoy the same regret guarantees asREGAL.C. In the following we further characterise problem (6), introduce a truncated value

(5)

s0 s1 a0 a1 a0 a1 r = 0 r = 0 r = 0.5 r = 1

Figure 2. Toy example with deterministic transitions and reward for all actions.

iteration algorithm to solve it, and finally integrate it into a UCRL-like scheme to recoverREGAL.Cregret guarantees.

4. The Optimization Problem

In this section we analyze some properties of the following optimization problem, of which (6) is an instance,

sup

π∈Πc(M )

{gπM} , (7)

where M is any MDP (with discrete or compact action space) s.t.Πc(M )6= ∅. Problem (7) aims at finding a policy

that maximizes the gaingπ

M within the set of randomized

policies with constant gain (i.e.,sp{gπ

M} = 0) and bias span

smaller thanc (i.e., sp{hπ

M} ≤ c). Since gπM ∈ [0, rmax]

the supremum always exists and we denote it byg∗ c(M ).

The set of maximizers is denoted byΠ∗

c(M ) ⊆ Πc(M ),

with elementsπ∗

c(M ) (if Π∗c(M ) is non-empty).

In order to give some intuition about the solutions of prob-lem (7), we introduce the following illustrative MDP. Example 1. Consider the two-states MDP depicted in Fig.2. For a generic stationary policyπ∈ ΠSRwith

deci-sion ruled∈ DMRwe have that

d =x 1 − x y 1− y  ; Pd=1 − x x y 1− y  , rd= 1−x 2 1− y  . We can compute the gaing = [g1, g2] and the bias h =

[h1, h2] by solving the linear system (1). For anyx > 0 or

y > 0, we obtain g1= g2=1 2 + x 1− 3y 2(x + y); h2− h1= 1 2 + 1− 3y 2(x + y), while forx = 0, y = 0, we have g1 = 1/2 and g2 = 1,

withh2 = h1 = 0. Note that 0 ≤ sp {hπ} ≤ 1 for any

π∈ ΠSR. In the following, we will use this example choosing

particular values forx, y, and c to illustrate some important properties of optimization problem (7).

Randomized policies. The following lemma shows that, unlike in unconstrained gain maximization where there al-ways exists an optimal deterministic policy, the solution of (7) may indeed be a randomized policy.

Lemma 2. There exists an MDPM and a scalar c ≥ 0, such thatΠ∗

c(M )6= ∅ and Π∗c(M )∩ ΠSD(M ) =∅.

Proof. Consider Ex.1with constraint1/2 < c < 1. The only deterministic policyπDwith constant gain and bias

span smaller thanc is defined by the decision rule with x = 0 and y = 1, which leads to gπD = 1/2 and sp{hπD} =

1/2. On the other hand, a randomized policy πRcan satisfy

the constraint and maximize the gain by takingx = 1 and y = (1− c)/(1 + c), which gives sp {hπR} = c and gπR =

c > gπD, thus proving the statement.

Constant gain. The following lemma shows that if we consider non-constant gain policies, the supremum in (7) may not be well defined, as no dominating policy exists. A policyπ∈ ΠSRis dominating if for any policyπ0

∈ ΠSR,

(s)≥ gπ0

(s) in all states s∈ S.

Lemma 3. There exists an MDPM and a scalar c ≥ 0, such that there exists no dominating policyπ in ΠSRwith

constrained bias span (i.e.,sp{hπ} ≤ c).

Proof. Consider Ex.1with constraint1/2 < c < 1. As shown in the proof of Lem.2, the optimal stationary policy πRwith constant gain hasgc∗ = [c, c]. On the other hand,

the only policyπ with non-constant gain is x = 0, y = 0, which hassp{hπ} = 0 < c and gπ(s

0) = 1/2 < c = g∗c

andgπ(s

1) = 1 > c = gc∗, thus proving the statement.

On the other hand, when the search space is restricted to policies with constant gain, the optimization problem is well posed. Whether problem (7) always admits a maximizer is left as an open question. The main difficulty comes from the fact that, in general,π7→ gπ is not a continuous map

andΠcis not a closed set. For instance in Ex.1, although

the maximum is attained, the pointx = 0, y = 0 does not belong toΠc(i.e.,Πcis not closed) andgπis not continuous

at this point. Notice that when the MDP is unichain ( Puter-man,1994, Sec. 8.3),Πcis compact,gπis continuous, and

we can prove the following lemma (see App.A): Lemma 4. IfM is unichain then Π∗

c(M )6= ∅.

We will later show that for the specific instances of (7) that are encountered by our algorithmSCAL, Lem.4holds.

5. Planning with

SCOPT

In this section, we introduceSCOPTand derive sufficient

conditions for its convergence to the solution of (7). In the next section, we will show that these assumptions always hold whenSCOPTis carefully integrated intoUCRL(while in App.Bwe show that they may not hold in general).

5.1. Span-constrained value and policy operators SCOPTis a version of (relative) value iteration (Puterman,

1994;Bertsekas,1995), where the optimal Bellman operator is modified to return value functions with span bounded byc, and the stopping condition is tailored to return a constrained greedypolicy with near-optimal gain. We first introduce a constrainedversion of the optimal Bellman operatorL.

(6)

Input: Initial vectorv0∈ RS, reference states∈ S, contractive

factorγ∈ (0, 1), accuracy ε ∈ (0, +∞) Output: Vectorvn∈ RS, policyπn= (Gcvn)∞

1. Initializen = 0 and v1= Tcv0− (Tcv0)(s)e,

2. Whilesp{vn+1− vn} + 2γ n 1−γsp{v1− v0} > ε do (a) n += 1. (b) vn+1= Tcvn− (Tcvn)(s)e. Figure 3. Algorithm SCOPT. Definition 1. Givenv∈ RSandc

≥ 0, we define the value operatorTc: RS → RSas Tcv = ( Lv(s) ∀s ∈ S(c, v), c + mins{Lv(s)} ∀s ∈ S \ S(c, v), (8) whereS(c, v) = {s ∈ S|Lv(s) ≤ mins{Lv(s)} + c}.

In other words, operator Tc applies a span truncation to

the one-step application ofL, that is, for any state s∈ S, Tcv(s) = min{Lv(s), minxLv(x) + c}, which guarantees

thatsp{Tcv} ≤ c. Unlike L, operator Tcis not always

as-sociated with a decision ruled s.t. Tcv = Ldv (see App.B).

We say thatTc is feasible atv ∈ RS ands ∈ S if there

exists a distributionδ+

v(s)∈ P(A) such that

Tcv(s) =

X

a∈As

δv+(s, a)r(s, a) + p(·|s, a)Tv. (9)

When a distributionδ+

v(s) exists in all states, we say that

Tcis globally feasible atv, and δ+v is its associated decision

rule, i.e.,Tcv = Lδ+

vv. In the following lemma, we identify

sufficient and necessary conditions for (global) feasibility. Lemma 5. OperatorTcisfeasible atv∈ RS ands∈ S if

and only if min a∈As{r(s, a) + p(·|s, a) Tv } ≤ mins0 {Lv(s 0) } + c. (10) Furthermore, let D(c, v) :=d ∈ DMR | sp {Ldv} ≤ c (11)

be the set of randomized decision rulesd whose associated operatorLdreturns a span-constrained value function when

applied tov. Then, Tcv is globally feasible if and only if

D(c, v)6= ∅, in which case we have Tcv = max δ∈D(c,v)Lδv, and δ + v ∈ arg max δ∈D(c,v) Lδv. (12)

The last part of this lemma shows that whenTcis globally

feasible atv (i.e., D(c, v)6= ∅), Tcv = Lδ+

vv is the

compo-nentwise maximalvalue function of the formLδv with

deci-sion ruleδ∈ DMRsatisfyingsp

{Lδv} ≤ c. Surprisingly,

even in the presence of a constraint on the one-step value span, such a componentwise maximum still exists (which is not as straightforward as in the case of the greedy oper-atorL). Therefore, whenever D(c, v) 6= ∅, optimization problem (12) can be seen as an LP-problem (see App.A.2).

Definition 2. Given v ∈ RS andc

≥ 0, let eS(c, v) be the set of states whereTcv is feasible (condition (10)) with

δ+

v(s) be the associated decision rule (Eq.9). We define the

operatorGc: RS → DMRas3 Gcv =    δ+ v(s) s∈ eS(C, v), arg min a∈As r(s, a) + p(·|s, a)Tv s∈ S \ eS(C, v). As a result, ifTc is globally feasible at v, by definition

Gcv = δ+v. Note that computingδv+ is not significantly

more difficult than computing a greedy policy (see App.C

for an efficient implementation).

We are now ready to introduce SCOPT(Fig. 3). Given a

vectorv0∈ RSand a reference states,SCOPTimplements

relative value iteration whereL is replaced by Tc, i.e.,

vn+1= Tcvn− Tcvn(s)e. (13)

Notice that the term (Tcvn)(s)e subtracted at any

itera-tionn prevents vnfrom increasing linearly withn and thus

avoids numerical instability. However, the subtraction can be dropped without affecting the convergence properties ofSCOPT. If the stopping condition is met at iterationn,

SCOPTreturns policyπn= dn∞wheredn= Gcvn.

5.2. Convergence and Optimality Guarantees

In order to derive convergence and optimality guarantees forSCOPTwe need to analyze the properties of operatorTc.

We start by proving that Tc preserves the one-step span

contractionproperties ofL.

Assumption 6. The optimal Bellman operatorL is a 1-step γ-span-contraction, i.e., there exists a γ < 1 such that for any vectorsu, v∈ RS,sp

{Lu − Lv} ≤ γsp {u − v}.4

Lemma 7. Under Asm.6,Tcis aγ-span contraction.

The proof of Lemma7relies on the fact that the truncation ofL in the definition of Tcis non-expansive in span

semi-norm. Details are given in App.D, where it is also shown thatTcpreserves other properties ofL such as monotonicity

and linearity. It then follows thatTcadmits a fixed point

solution to an optimality equation (similar toL) and thus SCOPT converges to the corresponding bias and gain, the

latter being an upper-bound on the optimal solution of (7). We formally state these results in Lem.8.

Lemma 8. Under Asm.6, the following properties hold:

1. Optimality equation and uniqueness: There exists a solution(g+, h+)

∈ R×RSto the optimality equation

Tch+= h++ g+e. (14)

3When there are several policies δ+

v achieving Tcv(s) =

Lδ+

vv(s) in state s∈ S, Gcchooses an arbitrary decision rule.

4

In the undiscounted setting, if the MDP is unichain,L is a J-stage contraction with S≥ J ≥ 1.

(7)

If(g, h)∈ R × RS is another solution of (14), then

g = g+and there existsλ

∈ R s.t. h = h++ λe.

2. Convergence: For any initial vector v0 ∈ RS, the

sequence(vn) generated bySCOPTconverges to a

so-lution vectorh+of the optimality equation(14), and

lim

n→+∞T n+1

c v0− Tcnv0= g+e.

3. Dominance: The gaing+ is an upper-bound on the

supremum of (7), i.e.,g+≥ g∗ c.

A direct consequence of point 2 of Lem.8(convergence) is thatSCOPTalways stops after a finite number of iterations. Nonetheless,Tcmay not always be globally feasible ath+

(see App.B) and thus there may be no policy associated to optimality equation (14). Furthermore, even when there is one, Lem.8provides no guarantee on the performance of the policy returned bySCOPT after a finite number of

iterations. To overcome these limitations, we introduce an additional assumption, which leads to stronger performance guarantees forSCOPT.

Assumption 9. OperatorTcis globally feasible at any

vec-torv∈ RS such thatsp{v} ≤ c.

Theorem 10. Assume Asm.6and9hold and letγ denote the contractive factor ofTc(Asm.6). For anyv0∈ RS such

thatsp{v0} ≤ c, any s ∈ S and any ε > 0, the policy πn

output bySCOPT(v0, s, γ, ε) is such thatkg+e−gπnk∞≤ ε.

Furthermore, if in addition the policyπ+ = (Gch+)∞is

unichain,g+is the solution to optimization problem(7) i.e.,

g+= g

c andπ+∈ Π∗c.

The first part of the theorem shows that the stopping condi-tion used in Fig.3ensures thatSCOPTreturns anε-optimal

policyπn. Notice that whilesp{h+} = sp {Tch+} ≤ c

by definition of Tc, in general when the policy π+ =

(Gch+)∞associated toh+is not unichain, we might have

sp{h+

} < sp{hπ+

}. On the other hand, Corollary 8.2.7. of Puterman (1994) ensures that ifπ+ is unichain then

sp{h+

} = sp{hπ+

}, hence the second part of the theorem. Notice also that even ifπ+is unichain, we cannot guarantee

thatπnsatisfies the span constraint, i.e.,sp{hπn} may be

arbitrary larger thanc. Nonetheless, in the next section, we show that the definition ofTcand Thm.10are sufficient to

derive regret bounds whenSCOPTis integrated intoUCRL.

6. Learning with

SCAL

In this section we introduceSCAL, an optimistic online RL algorithm that employsSCOPTto compute policies that

efficiently balance exploration and exploitation. We prove that the assumptions stated in Sec.5.2hold whenSCOPTis

integrated into the optimistic framework. Finally, we show thatSCALenjoys the same regret guarantees asREGAL.C, while being the first implementable and efficient algorithm to solve bias-span constrained exploration-exploitation.

Based on Def.1, we define eTcas the span truncation of the

optimal Bellman operator eL of the bounded-parameter MDP f

Mk (see Sec.3). Given the structure of problem (6), one

might consider applyingSCOPT(using eTc) to the extended

MDP fMk. Unfortunately, in general eL does not satisfy

Asm.6and9and thus eTc may not enjoy the properties of

Lem.8and Thm.10. To overcome this problem, we slightly modify fMkas described in Def.3.

Definition 3. Let fM be a bounded-parameter (extended) MDP. Let1 ≥ η > 0 and s ∈ S an arbitrary state. We define the “modified” MDP fM‡associated to fM by5

B‡r(s, a) = [0, max{Br(s, a)}], B‡ p(s, a, s0) = ( Bp(s, a, s0) if s06= s, Bp(s, a, s)∩ [η, 1] otherwise,

where we assume that η is small enough so that: Bp(s, a, s)∩ [η, 1] 6= ∅, Ps0∈Smin{Bp‡(s, a, s0)} ≤ 1,

andP

s0∈Smax{Bp‡(s, a, s0)} ≥ 1. We denote by eL‡ the

optimal Bellman operator of fM‡(cf. Eq.4) and by eT‡ c the

span truncation of eL‡(cf. Def.1).

By slightly perturbing the confidence intervalsBpof the

transition probabilities, we enforce that the “attractive” state s is reached with non-zero probability from any state-action pair(s, a) implying that the ergodic coefficient of fM‡

γ = 1 min s,u∈S, a,b∈A e p,q∈Be ‡ p      X j∈S

min{ep(j|s, a),eq(j|u, b)}

| {z } ≥η if j=s     

is smaller than1− η < 1, so that eL‡isγ-contractive (

Puter-man,1994, Thm. 6.6.6), i.e., Asm.6holds. Moreover, for any policyπ∈ ΠSR( f

M‡), state s necessarily belongs to all

recurrent classesofπ implying that π is unichain and so f

M‡is unichain. As is shown in Thm.11, theη-perturbation

ofBpintroduces a small biasηc in the final gain.

By augmenting (without perturbing) the confidence inter-valsBrof the rewards, we ensure two nice properties. First

of all, for any vectorv ∈ RS, eLv = eLv and thus by

def-inition eTcv = eTc‡v. Secondly, there exists a decision rule

δ∈ DMR( f M‡) such that ∀s ∈ S,re ‡ δ(s) = 0 meaning that sp{eL‡δv} = sp{ eP ‡ δv} ≤ sp {v} (Puterman,1994,

Propo-sition 6.6.1). Thus ifsp{v} ≤ c then sp{eL‡δv} ≤ c and soδ ∈ eD‡(c, v)6= ∅ which by Lem.5implies that eT

c is

globally feasible atv. Therefore, Asm.9holds in fM‡. When combining both the perturbation ofBpand the

aug-mentation ofBrwe obtain Thm.11(proof in App.E).

5

For any closed interval[a, b] ⊂ R, max{[a, b]} := b and min{[a, b]} := a

(8)

Theorem 11. Let fM be a bounded-parameter (extended) MDP and fM‡its “modified” counterpart (see Def.3). Then

1. eL‡ is aγ-span contraction with γ

≤ 1 − η < 1 (i.e., Asm.6holds) and thus Lem.8applies to eT‡

c. Denote

by(g+, h+) a solution to equation (14) for eT‡ c.

2. eT‡

c is globally feasible at anyv ∈ RSs.t.sp{v} ≤ c

(i.e., Asm.9holds) and fM‡is unichain implying that π+= G

ch+is unichain. Thus Thm.10applies to eTc‡.

3. ∀µ ∈ Πc( fM), g+= gc∗( fM‡)≥ gµ( fM) − ηc.

SCAL(cf. Fig.1) is a variant of UCRL that appliesSCOPT

(instead of EVI, see Eq. 4) on the bounded parameter MDP fMk (instead of fMk, cf. step 2 in Fig.1) in each

episodek to solve the optimization problem max

M ∈ fM‡k,π∈Πc(M )

gπM, (15)

whose maximum is denoted byg∗ c( fM

k). The intervals Bp‡

of fMkare constructed using parameter6η

k = rmax/(c· tk)

and an arbitrary attractive states ∈ S. SCOPT is run at

step 3 in Fig.1with an initial value functionv0 = 0, the

same reference states used for the construction of B‡ p,

con-traction factorγk = 1− ηk, and accuracyεk= rmax/√tk.

SCOPTfinally returns an optimistic (nearly) optimal policy

satisfying the span constraint. This policy is executed until the end of the episode.

Thm.11ensures that the specific instance of problem (6) forSCAL(i.e., problem (15)) is well defined and admits a maximizerπ∗

c( fM‡k) that can be efficiently computed using

SCOPT. Moreover, up to an accuracyηk· c = rmax/tk,

pol-icyπc∗( fM ‡

k) is still optimistic w.r.t. all policies in the set of

constrained policiesΠc( fMk) for the initial extended MDP.

Since the true (unknown) MDPM∗belongs to

Mk with

high probability, under the assumption thatsp{h

M∗} ≤ c,

g∗ c( fM

k)≥ g∗M∗− rmax/tk. As briefly mentioned in Sec.5,

in practiceSCOPTcan only output an approximationeµkof

π∗ c( fM

k) and we have no guarantees on sphµek . However,

the regret proof ofSCALonly uses the fact thatsp{vn} ≤ c

and this is always satisfied by definition of eT‡

c. We are now

ready to prove the following regret bound (see App.F). Theorem 12. For any weakly communicating MDPM such thatsp{h

M} ≤ c, with probability at least 1 − δ it

holds that for anyT ≥ 1, the regret of SCALis bounded as ∆(SCAL, T ) =O max{rmax, c}

s ΓSAT ln T δ ! , 6

Notice that given thatβp,ksa ≥ ηkfor all(s, a)∈ S × A (see

definition in Sec.3), the assumptions of Def.3hold trivially.

whereΓ = maxs∈S,a∈Akp(·|s, a)k0 ≤ S is the maximal

number of states that can be reached from any state. The previous bound shows that whenc ≤ rmaxD,SCAL

scales linearly with c, while UCRLscales linearly with rmaxD (all other terms being equal). Notice that the gap

betweensp{h

} and D can be arbitrarily large, and thus the improvement can be significant in many MDPs. As an extreme case, in weakly communicating MDPs the diameter can be infinite, leadingUCRLto suffer linear regret, while SCAL is still able to achieve sub-linear regret. However whenc > rmaxD, given that the true MDP M∗may not

be-long toMk, we cannot guarantee that the span of the value functionvnreturned bySCOPTis bounded byrmaxD.

Nev-ertheless, we can slightly modifySCALto address this case: at the beginning of any episodek, we run bothSCOPT(with

the same inputs) andEVI(as inUCRL) in parallel and pick the policy associated to the value with smallest span. With this modification,SCALenjoys the best of both worlds, i.e., the regret scales withmin{max{rmax, c}, rmaxD} instead

ofc. When c is wrongly chosen (c < sp{h

M∗}),SCAL

converges to a policy inΠ∗

c(M∗) which can be arbitrarily

worse than the true optimal policy inM∗. For this reason we

cannot prove a regret bound in this scenario. Finally, notice that the benefit ofSCALoverUCRLcomes at a negligible additional computational cost.

7. Numerical Experiments

In this section, we numerically validate our theoretical find-ings. The code is available onGitHub. In particular, we show that the regret ofUCRLindeed scales linearly with the diameter, whileSCALachieves much smaller regret that only depends on the span. This result is even more extreme in the case of non-communicating MDPs, whereD =∞. Consider the simple but descriptive three-state domain shown in Fig.4(a)(results in a more complex domain are reported in App.G). In this example, the learning agent only has to choose which action to play in states2(in all other

states there is only one action to play). The rewards are dis-tributed as Bernoulli with parameters shown in Fig.4(a)and rmax= 1. The optimal policy π∗is such thatπ∗(s2) = a1

with gaing∗ = 2 3 and biash ∗ =h−2−δ 3(1−δ), −1 1−δ, 0 i . Ifδ is small,sp{h∗ } = 1 1−δ ≈ 1, while D ≈ 1 δ. Fig.4(b)shows

that, as predicted by theory, the regret ofUCRL(for a fixed horizonT ) grows linearly with 1δ ≈ D. The optimal bias span however is roughly equal to1. Therefore, we expect SCALto clearly outperformUCRLon this example. In all the experiments, we noticed that perturbing the extended MDP was not necessary to ensure convergence ofSCOPT

and so we setηk = 0. We also set γk= 0 to speed-up the

execution ofSCOPT(see stopping condition in Fig.3). Communicating MDPs. We first setδ = 0.005 > 0,

(9)

giv-s0 s1 s2 a0 δ 1− δ a0 a0 δ 1− δ a1 r = 0 r ∼ Be 1 3  r ∼ Be 2 3  r ∼ Be 2 3  (a) 102 103 104 104 105 106 1/δ Cum ulativ e Regret (UCRL) (b)

Figure 4. (upper) Simple three-state domain. (lower)Cumulative regret incurred by UCRL afterT = 2.5· 107steps as a function of the

diameterD≈ 1/δ (averaged over 20 runs).

8 10 12 0 5 10 15 Log-Duration ln(T ) sp (vn ) 0 1 2 3 4 5 ·105 0 1 2 3 ·104 Duration T Cum ulativ e Regret ∆( T ) 8 10 12 14 16 0 2,000 4,000 6,000 8,000 Log-Duration ln(T ) sp (vn ) SCAL c = 2 SCAL c = 5 SCAL c = 10 SCAL c = 15 SCAL c = 20 UCRL 0 1 2 3 4 5 ·107 0 2 4 ·106 Duration T Cum ulativ e Regret ∆( T ) (a) (b)

Figure 5. Results in the three-states domain withδ = 0.005 (top) and δ = 0 (bottom). We report the span of the optimistic bias (left) and the cumulative regret (right) as a function ofT . Results are averaged over 20 runs and 95% confidence intervals are shown. ing a communicating MDP. With such a smallδ, visiting

states1is rather unlikely. Nonetheless, sinceUCRLis based

on the OFU principle, it keeps trying to visits1(i.e., play

a0ins2) until it collects enough samples to understand that

s1is actually a bad state (before that,UCRL“optimistically”

assumes that s1 is a highly rewarding state). Therefore,

UCRLplaysa0ins2for a long time and suffers large regret.

This problem is particularly challenging for any learning algorithm solely employing optimism likeUCRL(cf. ( Or-tner,2008) for a more detailed discussion on the intrinsic limitations of optimism in RL). In contrast,SCALis able to mitigate this issue when an appropriate constraintc is used. More precisely, whenevers1is believed to be the most

rewarding state, the value function (bias) is maximal ins1

andSCOPTapplies a “truncation” in that state and “mixes” deterministic actions. In other words,SCALleverages on the prior knowledge of the optimal bias span to understand thats1cannot be as good as predicted (from optimism). The

exploration of the MDP is greatly affected asSCALquickly discovers that actiona0 ins2 is suboptimal. Therefore,

SCALis always performing better than UCRL (Fig.5(a)) and the smallerc, the better the regret. Surprisingly the actualpolicy played by SCALin this particular MDP is always deterministic.SCOPTmixes actions ins1where only

one true action is available but the mixing happens in the extendedMDP fMkwhere the action set is compact. The policy thatSCOPToutputs is thus stochastic in the extended MDP but deterministic in the true MDP.

Infinite Diameter. By selecting δ = 0 the diameter be-comes infinite (D = +∞) but the MDP is still weakly

communicating (with transient states1). UCRL is not able

to handle this setting and suffers linear regret. On the con-trary,SCALis able to quickly recover the optimal policy (see Fig.5(b) and App.G).

8. Conclusion

In this paper we introducedSCAL, aUCRL-like algorithm that is able to efficiently balance exploration and exploita-tion in any weakly communicating MDP for which a finite boundc on the optimal bias span sp{h

} is known. While UCRLexclusively relies on optimism and usesEVIto com-pute the exploratory policy,SCALleverages the knowledge ofc through the use ofSCOPT, a new planning algorithm

specifically designed to handle constraints on the bias span. We showed both theoretically and empirically thatSCAL achieves smaller regret thanUCRL. AlthoughSCALwas inspired byREGAL.C, it is the only implementable approach so far. Therefore, this paper answers the long-standing open question of whether it is actually possible to design an algo-rithmthat does not scale with the diameterD in the worst case. Moreover,SCALpaves the way for implementable algorithms able to learn in an MDP with continuous state space. Indeed, existing algorithms achieving regret guaran-tees in this framework (Ortner and Ryabko,2012; Laksh-manan et al.,2015) all rely onREGAL.C. We also believe that our approach can easily be extended to optimisticPSRL (Agrawal and Jia,2017) to achieve an even better regret bound of eOmin{c, rmaxD}

SAT, i.e., drop the depen-dency inΓ. Finally, we leave it as an open question whether the assumption thatc is known can be relaxed.

(10)

Acknowledgements

This research was supported in part by French Ministry of Higher Education and Research, Nord-Pas-de-Calais Regional Council and French National Research Agency (ANR) under project ExTra-Learn (n.ANR-14-CE24-0010-01). Furthermore, this work was supported in part by the Austrian Science Fund (FWF): I 3437-N33 in the framework of the CHIST-ERA ERA-NET (DELTA project).

References

Yasin Abbasi-Yadkori and Csaba Szepesv´ari. Bayesian optimal control of smoothly parameterized systems. In Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence, UAI 2015, pages 1–11. AUAI Press, 2015.

Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems 30, NIPS 2017, pages 1184–1194, 2017. Jean-Yves Audibert, R´emi Munos, and Csaba Szepesv´ari.

Tuning bandit algorithms in stochastic environments. In Algorithmic Learning Theory, 18th International Con-ference, ALT 2007, volume 4754 of Lecture Notes in Computer Science, pages 150–165. Springer, 2007. Mohammad Gheshlaghi Azar, Ian Osband, and R´emi Munos.

Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Ma-chine Learning, ICML 2017, volume 70 of Proceedings of Machine Learning Research, pages 263–272, 2017. Peter L. Bartlett and Ambuj Tewari. REGAL: A

regulariza-tion based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the 25th Confer-ence on Uncertainty in Artificial IntelligConfer-ence, UAI 2009, pages 35–42. AUAI Press, 2009.

Dimitri P Bertsekas. Dynamic programming and optimal control. Vol II. Athena Scientific, 1995.

Ronen I. Brafman and Moshe Tennenholtz. R-MAX - A general polynomial time algorithm for near-optimal re-inforcement learning. Journal of Machine Learning Re-search, 3:213–231, 2002.

Christoph Dann and Emma Brunskill. Sample complex-ity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems 28, NIPS 2015, pages 2818–2826, 2015.

Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563–1600, 2010.

Donald E. Knuth. The Art of Computer Programming, Vol-ume 2: SeminVol-umerical Algorithms. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 3rd edition, 1997.

K. Lakshmanan, Ronald Ortner, and Daniil Ryabko. Im-proved regret bounds for undiscounted continuous rein-forcement learning. In Proceedings of the 32nd Inter-national Conference on Machine Learning, ICML 2015, volume 37 of Proceedings of Machine Learning Research, pages 524–532, 2015.

Andreas Maurer and Massimiliano Pontil. Empirical Bern-stein bounds and sample-variance penalization. In COLT 2009 - The 22nd Conference on Learning Theory, 2009. Ronald Ortner. Optimism in the face of uncertainty should be refutable. Minds and Machines, 18(4):521–526, 2008. Ronald Ortner and Daniil Ryabko. Online regret bounds

for undiscounted continuous reinforcement learning. In Advances in Neural Information Processing Systems 25, NIPS 2012, pages 1772–1780, 2012.

Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. CoRR, abs/1608.02732, 2016.

Ian Osband and Benjamin Van Roy. Why is posterior sam-pling better than optimism for reinforcement learning? In Proceedings of the 34th International Conference on Ma-chine Learning, ICML 2017, volume 70 of Proceedings of Machine Learning Research, pages 2701–2710, 2017. Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems 26, NIPS 2013, pages 3003–3011, 2013.

Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown Markov decision processes: A Thompson sampling approach. In Advances in Neural Information Processing Systems 30, NIPS 2017, pages 1333–1342, 2017.

Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1994.

Eugene Seneta. Sensitivity of finite Markov chains under perturbation. Statistics & Probability Letters, 17(2):163– 168, 1993.

William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.

(11)

Index of the Appendix

We start providing a brief recap of the content of the appendix: • App.A

– Proof of Lem.4: IfM is unichain then Π∗

c(M )6= ∅. As a starting point the continuity of gain g and span h w.r.t.

the policy is proved (see Lem.13).

– The policy associated toTcv can be interpreted as a solution of an LP problem (see App.A.2)

• App.B

– Shows the limitations ofSCOPT(Tc): non-feasibilityB.1and non-convergenceB.2.

• App.C

– We show how to compute the policy associated to the operatorTcv when Tcis feasible inv. We consider both

MDPs and extended MDPs. • App.D

– Proof of Lem.5i.e., when operatorTcis feasible (see App.D.1).

– Proof of Lem.7i.e., that under Asm.6,Tcis a span contraction (see App.D.2).

– Proof of Lem.8i.e., existence and uniqueness of the optimality equation forTc, convergence ofSCOPTand gain

dominance (see App.D.3)

– Proof of Thm.10i.e., stopping condition forSCOPTand approximation guarantees (see App.D.4).

• App.E

– A formal definition of perturbed extended MDP (see Lem.19) and span contraction property for eL in the perturbed MDP.

– A formal definition of (reward) augmented extended MDP (see Lem.20), equality of the operator in the original and augmented extend MDPs, and non-emptiness ofD(c, v) in the augmented MDP when sp{v} ≤ c.

– Proof of Thm.11. We prove existence and uniqueness of the optimality equation forTc, convergence ofSCOPT

and gain dominance for the perturbed and augmented extended MDP (i.e.,MK) (see Thm.21). • AppF

– Proof of Thm.12i.e., the regret ofSCOPT.

• AppG

– We testSCALandUCRLon a larger and more challenging domain (the knight quest)

A. Optimization with bias span constraint

A.1. Existence of gain optimal policies under bias-span constraint: the unichain case (proof of Lem.4) In this section we provide a formal proof of Lem.4.

In unichain MDPs, all policiesπ∈ ΠSRhave a constant gaingπ(Puterman,1994, section 8.4), thus the search space reduces

toΠc={π ∈ ΠSR: sp{hπ} ≤ c}. We assume that Πc6= ∅. We first prove the following lemma.

Lemma 13. In a unichain MDP,g : π 7→ gπ andh : π

7→ hπ are continuous maps fromΠSRto R and ΠSR to RS

respectively.

Proof. Let’s consider two stationary policiesπ = d∞

∈ ΠSRand

b π = bd∞

∈ ΠSR. Denote byP and bP the transition

matrices associated tod and bd respectively. Since the MDP is unichain by assumption, the Markov Chains characterized byP and bP each have a unique stationary distribution µ andbµ respectively. We express the gap µ−µ using the sameb decomposition asSeneta(1993) (µ| b µ|)(I− bP + e b µ|) = µ|(P − bP ) =⇒ (µ| b µ|) = µ|(P− bP )H b P

(12)

whereI is the identity matrix, e = (1 . . . , 1)|is the vector of all 1’s andH b

P = (I− bP + ebµ

|)−1

− ebµ|is the Drazin

inverseofI− bP also known as the deviation matrix of bP (always well-defined, see Appendix A of (Puterman,1994)). The above equality implies that

kµ| b µ|k 1≤ kµ|k1 | {z } =1 kP − bPk∞,1kHPbk∞,∞ =kP − bPk∞,1kHPbk∞,∞

wherekAk∞,1:= maxiPj|Aij| and kAk∞,∞ := maxi,j|Aij|. As a consequence of the above inequality, when P → bP

we haveµµ. Moreover, when db → bd we have P → bP by linearity and thus by composition: lim

d→ bd

µ =µb

Denote byr andbr the reward functions associated to d and bd respectively. We have gπ = µ|r and gbπ=

b µ|

b

r and since r is linear (hence continuous) ind we conclude that

lim

π→bπg π= g

or in other words,g : π7→ gπis continuous at

b

π and sincebπ was chosen arbitrarily, g is continuous everywhere. Similarly, hπ= H

Pr and P 7→ HPis continuous inP (the computation of HP involves only continuous operations ofP and µ like

addition, multiplication and inversion of matrices) and thereforehπis continuous too.

Note that Lem.13does not hold in general when the MDP is not unichain (see Ex.1whenx→ 0 and y → 0). Sincesp{·} is a semi-norm, it is a continuous map from ΠSRto R and so the function f : π

7→ sp {hπ

} is continuous by composition. Sincef is continuous, ΠSRis compact and R is a Hausdorff space, we know from basic topology that f is a

proper map i.e, the preimage of every compact set in R by f is compact in ΠSR. Since we can expressΠ

cas the preimage of

the compact interval[0, c] by f i.e., Πc= f−1([0, c]), it is clear that Πcis compact. As a result, sincegπis continuous inπ

andΠcis compact, by Weierstrass extreme value theorem the maximum ofgπis attained inΠcand soΠ∗c 6= ∅.

A.2. Greedy policy under bias span constraint: LP formulation

In this section, we show that the policy associated toTcv can be interpreted as the solution of a Linear Programming (LP)

problem.

As mentioned in Sec.5.1, a consequence of Lem.5(see proof in App.D) is that wheneverD(c, v) 6= ∅, there exists δ+v ∈ D(c, v) such that Lδ+vv≥ Ldv for all d∈ D(c, v) component-wise, and moreover Tcv = Lδv+v. As a result, we can

expressδ+

v as a maximizer of the following optimization problem

max

d∈D(c,v){(Ldv)

|e} (16)

wheree = (1 . . . 1)|is the vector of all 1’s. The maximum of (16) is then(T

cv)|e. Since d7→ rdandd7→ Pdare linear

maps, the function we maximized7→ (Ldv)|e is also linear in d. Moreover, the set D(c, v) can be expressed as a set of

S× (S − 1) linear constraints in Ldv:

Ldv(s)− Ldv(s0)≤ c, ∀s 6= s0

Therefore, optimization problem (16) can be formulated as an LP problem. But of course it is much easier to computeTcv

using Def.1. The policyδ+

v associated toTcv can also be computed efficiently without solving (16) (see App.Cfor more

details).

Remark. Recall that computing the maximal gain of an MDP can be done by solving the following primal LP problem (Puterman,1994, Section 8.8)

min

g∈R,h∈RS{g}

s.t. g + h(s)X

s0∈S

p(s0|s, a)h(s0)≥ r(s, a), ∀s ∈ S, ∀a ∈ As

(17)

(13)

s0 s1

a0,r = 0

a1,r = 0

a0,r = 1

a1,r = 1

Figure 6. Example showing thatTcmight

not always be feasible even whenΠ∗c 6= ∅.

s0 s1 s2 a0,r = α a1,r = δ a0,r = β a0,r = 0

Figure 7. Example showing thatTcmight

not always be feasible at its fixed-pointh+.

s0 s1 s2 r = 1 1− δ δ r = 0 r = 0

Figure 8. Example showing that the se-quence(Tc)nv0 might not converge even

when Πc 6= ∅ and all policies are both

unichain and aperiodic.

primal formulation with the addition ofS× (S − 1) linear constraints in h (as we did above for Ldv):

h(s)− h(s0)

≤ c, ∀s 6= s0

Unfortunately it is not that simple. Indeed, the validity of LP problem(17) is a consequence of the following two properties (Puterman,1994, Theorem 8.4.1):

1) ge + h− Lh ≥ 0 =⇒ g ≥ g∗ 2) ge + h− Lh = 0 =⇒ g = g∗

In general, these properties no longer hold for operatorTcand optimal bias-span-constrained gaingc∗. Therefore, the LP

approach fails (one can easily try to solve the constrained LP on a simple MDP and observe the solution is incorrect). Using thedual formulation is also tricky because the span constraint is not linear in the dual variables. Whether it is possible to formulate problem(7) as an LP problem is left as an open question.

B. Limitations of

SCOPT

In this section, we illustrate the limitations of operatorTcon some simple examples. For convenience, we introduce notation

Ncto denote the (value) operator associated to policyGcv as

Ncv = L(Gcv)v. (18)

Trivially, ifTcv is globally feasible, then Ncv = Tcv and Gcv = δv+.

B.1. Non-feasibility ofTc

The following example shows that operatorTcmay generate vectors that do not correspond to a one-step policy evaluation,

i.e., there may not existsδ+

v ∈ DM Rsuch thatTcv = Lδ+

vv, even if sp{v} ≤ c and/or if Πc6= ∅.

Example 2. Consider the simple MDP provided in Fig.6. Letv = [0, 0] and c = 1/2. In this case the policy π playing a0in both states matches the constraintc (i.e., sp{hπ} ≤ c =⇒ π ∈ Πc =⇒ Πc6= ∅) and moreover sp {v} ≤ c, but

clearly there exists no policyδ+

v achievingLδ+

vv = Tcv and moreover

Tcv = [0, 1/2]6= Ncv = [0, 1]

The previous example shows that there may not exists a policy associated to the one-step application of operatorTc. Instead,

Ex.3shows thatTcmay also not be feasible at convergence. In particular, we show that, surprising as it may seem,SCOPT

can sometimes converge to a valueh+that is not associated to any policy, even whenΠ

c6= ∅ and even when Π∗c 6= ∅.

Example 3. Consider the simple MDPM of Fig.7where we assume thatβ < δ < α and we set c = α + β. The MDP is unichain and all gains are equal to0. The set of randomized decision rules can be parametrized by the probability p of

(14)

playinga1ins1and the associated set of bias functions is H =      α + (1− p) · β + p · δ (1− p) · β + p · δ 0  : p∈ [0, 1]    .

Let’s denote byh(p) the bias associated to a policy parameterized by p, then sp{h(p)} = α + (1 − p) · β + p · δ > c for all p > 0. So there exists only one policy π = d∞achieving the span constraint which playsa

0ins1(i.e.,p = 0) implying that

Πc(M ) ={d∞} = Π∗c(M )6= ∅ =⇒ π∗c = d∞. It is easy to verify that:

Fixed point of Tc : h+=   α + β δ 0  ∈ H/ Fixed point of Nc: h#=   α + δ δ 0  ∈ H but sph# > c,

AlthoughTcadmits a fixed pointh+, it is not globally feasible ath+. On the other hand, whileNcis globally feasible at its

fixed pointh#by definition,h#does not satisfy the bias constraint. One might be tempted to think that the problem in this

example arises from the fact thatΠc(M ) is a singleton but it is actually more subtle than that. Indeed, if we assume that

β > 0 and if we add an action a2ins1that goes tos2with probability1 and gives a reward 0, we face the same problem

but this timeΠc(M ) contains an infinite number of policies (since we include stochastic policies). The problem is actually

coming from the fact that the action played in the only state achieving maximum bias (i.e.,s0) is deterministic while the

action played in states1(which achieves a lower bias thans0) is stochastic.Tcis unable to converge to such policies: by

definition, it can only converge to a policy that plays a stochastic action in the states with maximal bias (it can also converge to a bias that is not associated to any policy like in this example).

The issue presented in Ex.3can be overcome by duplicating all actions and adjusting the rewards of the duplicated actions. More formally, denote bya the action obtained by duplicating a. The probability of transition is not modified (i.e., p(·|s, a) = p(·|s, a)) but the reward is set to the minimal value (i.e., r(s, a) = rmin= 0). Denote by M↓this “augmented”

MDP. It is easy to verify thath+(M) = h+(M ) = (α + β, β, 0)|but unlikeM , Madmits a policy associated toh+(M)

(using duplicated actions). As this example shows, augmenting the MDP never modifies the fixed point ofTcbut always

makesTcglobally feasible at any vectorv satisfying sp{v} ≤ c. Since by definition the fixed point h+ofTcsatisfies the

span constraint,Tcis globally feasible ath+. This example gives an intuition whySCALuses a modified MDP fM‡kwith

augmented rewards (the confidence intervalsBk

rare “augmented” by below).

B.2. Non-convergence ofTn c

It is rather easy to design an MDP for which the stopping condition ofSCOPT(i.e.,sp{vn+1− vn} ≤ ε) is never met

although all policies are unichain and aperiodic andΠc6= ∅. In contrast, for the optimal Bellman operator L, unichain and

aperiodicity are sufficient conditions to ensure that the stopping conditionsp{vn+1− vn} ≤ ε is met after a finite number

of iterations (Puterman,1994, Theorem 8.5.7).

Example 4. Consider the simple MDPM provided in Fig.8where we assume that1 > δ > 0 and 1/2≥ c > 0. There is only one action available in every state and thus there is only one decision ruled. In that case L = Ld. The contraction

condition of (Puterman,1994, Theorem 8.5.3) holds forJ = 2, i.e., L is a 2-stage span contraction. More precisely we have: Pd =   δ 1− δ 0 0 0 1 1 0 0   =⇒ Pd2=   δ2 δ − δ2 1 − δ 1 0 1 δ 1− δ 0   =⇒ γd def = 1− min s,u∈S    X j∈S min{Pd2(j|s), Pd2(j|u)}    = δ < 1

whereγdis theergodic coefficient associated to Markov ChainPd. This implies thatL is a 2-stage γd-span contraction:

spL2n+1v

− L2nv ≤ γn

(15)

starting fromv0= 0 proceeds as follows v0=   0 0 0  , v1=   c 0 0  , v2=   c 0 c  , v3=   2c c c 

= v1+ ce, . . . , v2n= v2+ nce, v2n+1= v1+ nce

wheree = (1, . . . , 1)|denotes the vector of all 1’s. We see that unlikeLnv

0,(Tc)nv0is cycling with period 2 and the

quantitysp{v2n+1− v2n} = sp {v2− v1} does not converge to 0. Although Lem.7shows that whenL is a J-stage span

contraction withJ = 1 then Tcis also a span contraction (proof in App.D), surprisingly(Tc)nv0might not converge when

J > 1. Note that in this example Πc(M ) = ∅ and so one might wonder whether when Πc(M )6= ∅ the sequence Tcnv0

converges in span semi-norm. Unfortunately, it is not the case. Take the same MDP, duplicate the action ins0and assign a

reward of0 to this new action (the new action loops on s0with probabilityδ and goes to s1with probability1− δ as the

original action, but the reward is0 instead of 1). In that case, Πc(M )6= ∅ for all c ≥ 0 but we still have L = Ldwhered

plays the original action ins0. Therefore, we face exactly the same problem as before althoughΠc(M )6= ∅.

C. Policy δ

+v

associated to T

c

v

In this section, we provide a detailed description on how to efficiently compute a policyδ+

v associated toTcv when Tc

is feasible atv. As mentioned in Sec.4, we say thatTcis feasible atv ∈ RS ands∈ S when there exists a distribution

δ+

v(s)∈ P(A) such that

Tcv(s) =

X

a∈As

δ+

v(s, a)r(s, a) + p(·, s, a)Tv. (19)

We distinguish between two types of states:

• Greedy states. When Lv(s) ≤ min{Lv} + c i.e., s ∈ S(c, v) (see Def.1),δ+

v(s) plays a deterministic greedy action

a∈ arg maxa∈As{r(s, a) + p(·|s, a) Tv

}.

• Truncated states. When Lv(s) > min{Lv} + c i.e., s /∈ S(c, v) (see Def.1), by definition ofL there exists at least one actiona (e.g., any greedy action) such that r(s, a) + p(·|s, a)Tv > min

{Lv} + c. In addition, under the assumption thatTcis feasible atv and s, we know from condition (10) of Lem.5(proof in App.D) that there exists an actiona

such thatr(s, a) + p(·|s, a)Tv

≤ min{Lv} + c. By the intermediate value theorem, we know that there exists a convex combinationδv+(s) of actions a and a achieving exactly Lδv+v(s) = min{Lv} + c. Note that there may exist multiple

policies achieving this value (e.g., when there are multiple actions achieving higher or smaller values thanmin{Lv}+c). However, to simplify the implementation, we can simply setδ+

v(s) to play with non-zero probability only a greedy

action a ∈ arg maxa∈As{r(s, a) + p(·|s, a) Tv

} and a minimal action a ∈ arg mina∈As{r(s, a) + p(·|s, a) Tv

}. Formally, letv = mina∈As{r(s, a) + p(·|s, a)

Tv

} and v = maxa∈As{r(s, a) + p(·|s, a) Tv } = Lv(s). Then, δ+v(s, a) =      (v− min{Lv} − c)/(v − v) if a = a (min{Lv} + c − v)/(v − v) if a = a 0 otherwise (20)

Bounded-Parameter MDP. When we consider a bounded-parameter MDP fM, the only change is in the computation of the minimal and maximal actions. Define

e

Lv(s) = max

a∈As,er∈Br(s,a),p∈Be p(s,a)

 e r +peTv

e

Lv(s) = min

a∈As,er∈Br(s,a),p∈Be p(s,a)

 e r +peTv

Then,a and a are the actions associated to eLv(s) and e

Lv(s), respectively. Given a and a, the policy δ+

v(s) associated to

Tcv(s) is computed as in (20). The maximum and minimum ofr fore re∈ Br(s, a) are easy to compute (they correspond to the extreme values of the closed intervalBr(s, a)). The maximum ofpe

Tv for

e

p(s0)∈ Bp(s, a, s0) can be computed inO(S)

operations using the algorithm described in (Dann and Brunskill,2015, Appendix A). To compute the minimum ofpeTv, the

exact same algorithm can be used with input−v instead of v since min

e

p{ep|v} = − maxpe{pe

(16)

D. Properties of operators T

c

and G

c

D.1. Feasibility ofTc(proof of Lemma5)

In this section, we prove Lem.5.

We start by proving that for any decision ruled∈ D(c, v), we have Tcv≥ Ldv component-wise. By definition of Tcand

the optimal Bellman operatorL it holds that

∀s ∈ S(c, v), ∀d ∈ D(c, v), Tcv(s) = Lv(s)≥ Ldv(s).

Moreover, letm := mins{Lv(s)}, then

∀s ∈ S \ S(c, v), ∀d ∈ D(c, v), Tcv(s) = m + c (21)

≥ m + sp {Ldv} (22)

= m + max{Ldv} − min{Ldv} (23)

≥ m + Ldv(s)− m = Ldv(s), (24)

Inequality (22) is a consequence of the fact thatd∈ D(c, v) ⇔ sp {Ldv} ≤ c. Denote by ˆs ∈ arg maxs{Lv(s)} any state

achieving minimum value forLv(s). Inequality (24) follows by noticing thatm = Lv(ˆs)≥ Ldv(ˆs)≥ min{Ldv}. In

conclusion, for anyd∈ D(c, v) and any s ∈ S, Tcv(s)≥ Ldv(s). This immediately implies that whenever Tcis globally

feasible atv, Tcv = maxδ∈D(c,v)Lδv and δv+∈ arg maxδ∈D(c,v)Lδv.

Let’s now prove the equivalence between the feasibility ofTcatv∈ RSands∈ S and condition (10) i.e.,

min

a∈As{r(s, a) + p(·, s, a) T

v} ≤ mins {Lv(s)} + c.

If condition (10) holds, we can use the constructive procedure described in App. C to construct a stochastic action δ+

v(s)∈ P(A) such that Lδ+

vv(s) = mins{Lv(s)} + c = Tcv(s) and thus Tcis feasible atv and s. On the other hand, if

condition (10) does not hold i.e.,mina∈As{r(s, a) + p(·|s, a) Tv

} > mins{Lv(s)} + c then it is clear that any d(s) ∈ P(A)

will be such thatLdv(s) > mins{Lv(s)} + c and so Tcis not feasible atv and s. By contraposition, if Tcis feasible atv

ands then condition (10) holds thus proving the equivalence.

Finally,Tcis globally feasible atv if and only condition (10) holds in every states∈ S. Trivially, if Tcis globally feasible

thensp{Tcv} = sp n Lδ+ vv o ≤ c and so δ+

v ∈ D(c, v) implying that D(c, v) 6= ∅. If D(c, v) 6= ∅ then there exists

δ ∈ DMRsuch thatsp

{Lδv} ≤ c. Assume that condition (10) does not hold in at least one states ∈ S meaning that

mina∈As{r(s, a) + p(·|s, a) Tv

} > mins{Lv(s)} + c which implies that for any d ∈ DMR,Ldv(s) > mins{Lv(s)} + c.

This contradicts the fact that

sp{Lδv} ≤ c =⇒ Lδv(s)≤ max

s {Ldv(s)} ≤ mins {Ldv(s)} + c ≤ mins {Lv(s)} + c

where we used the definition of the spansp{u} := max{u} − min{u} and the fact that Ldv ≤ Lv component-wise

by definition ofL implying that mins{Ldv(s)} ≤ mins{Lv(s)}. Therefore, condition (10) must hold in every state. In

conclusion,Tcis globally feasible atv if and only D(c, v)6= ∅.

D.2. Contraction property ofTc(proof of Lemma7)

The purpose of this section is to prove Lem.7. We first reinterpret operatorTcas the composition of a projectionΓcand the

optimal Belmman operatorL (Tc= ΓcL) and we prove interesting properties for ΓcandTc.

Recall thatTccan be seen as the truncation of the optimal Bellman operator i.e.,Tcv(s) = min{Lv(s), minx{Lv(x)} + c}.

The following lemma shows that the truncation step is actually a projection in span semi-norm. LetVc={v : sp {v} ≤ c}

be the “semi-ball” of span constrained value functions. For any vectorv∈ RSand anyc

≥ 0, we define the truncation operatorΓc: RS → VcasΓcv(s) = min{v(s), minx{v(x)} + c}.

Lemma 14. For any vectorv∈ RSandc≥ 0, Γ

cv is the projection of v on the semi-ball Vcin span semi-norm i.e.,

Γcv = min z∈Vc

sp{z − v} .

Figure

Figure 2. Toy example with deterministic transitions and reward for all actions.
Figure 4. (upper) Simple three-state domain.
Figure 8. Example showing that the se- se-quence (T c ) n v 0 might not converge even when Π c 6 = ∅ and all policies are both unichain and aperiodic.
Figure 9. UCRL and SCAL behaviour with δ = 0.005 in the three-states MDP.
+3

Références

Documents relatifs

In this case the space X tends to be uniformly sampled, that potentially limits the areas on which the predictive model is mistaken. The larger the dimension of X , the more

SQL has the sample and computational complexity of a same order since it performs only one Q-value update per sample, the same argument also applies to the standard Q-learning,

Team-based learning advocates stress that the course format and structure should reflect the course goals and objectives for students; for this course, I determined that

While the previous approaches assume that in both source and target task the same solution representation is used (e.g., a tabu- lar approach), Taylor and Stone (2007) consider

The context of use acts like a filter, the home page context deals with public data of the module whereas a course or a group context filters elements in connection

Our case study TWINS, described in Section II, aims to achieve global coordination, thus although request scheduling happens independently in each I/O node, decisions about the

The re- sults obtained on the Cart Pole Balancing problem suggest that the inertia of a dynamic system might impact badly the context quality, hence the learning performance. Tak-

Three experiments were carried out to both explore the natural incline and decline of agents colliding into each other throughout a simulation, and find ways in which it can be