• Aucun résultat trouvé

Fast active learning for pure exploration in reinforcement learning

N/A
N/A
Protected

Academic year: 2023

Partager "Fast active learning for pure exploration in reinforcement learning"

Copied!
37
0
0

Texte intégral

(1)

HAL Id: hal-02906985

https://hal.inria.fr/hal-02906985v2

Submitted on 27 Jul 2020 (v2), last revised 16 Jul 2021 (v3)

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Fast active learning for pure exploration in reinforcement learning

Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kaufmann, Edouard Leurent, Michal Valko

To cite this version:

Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kaufmann, Edouard Leurent, et al.. Fast active learning for pure exploration in reinforcement learning. [Research Report] DeepMind.

2020. �hal-02906985v2�

(2)

Fast active learning for pure exploration in reinforcement learning

Pierre Ménard [email protected]

Inria Lille, SequeL team

Omar Darwiche Domingues [email protected] Inria Lille, SequeL team

Anders Jonsson [email protected]

Universitat Pompeu Fabra

Emilie Kaufmann [email protected]

CNRS & ULille (CRIStAL), Inria Lille, SequeL team

Edouard Leurent [email protected]

Renault & Inria Lille, SequeL team

Michal Valko [email protected]

DeepMind Paris

Abstract

Realistic environments often provide agents with very limited feedback. When the environment is initially unknown, the feedback, in the beginning, can be completely absent, and the agents may first choose to devote all their effort onexploring efficiently.The exploration remains a challenge while it has been addressed with many hand-tuned heuristics with different levels of generality on one side, and a few theoretically-backed exploration strategies on the other. Many of them are incarnated by intrinsic motivation and in particularexplorations bonuses. A common rule of thumb for exploration bonuses is to use1/√

nbonus that is added to the empirical estimates of the reward, where n is a number of times this particular state (or a state-action pair) was visited. We show that, surprisingly, for a pure-exploration objective of reward-free exploration, bonuses that scale with1/nbring faster learning rates, improving the known upper bounds with respect to the dependence on the horizon H. Furthermore, we show that with an improved analysis of the stopping time, we can improve by a factorH the sample complexity in thebest- policy identification setting, which is another pure-exploration objective, where the environment provides rewards but the agent is not penalized for its behavior during the exploration phase.

1 Introduction

In reinforcement learning (RL), an agent learns how to act by interacting with an environment, which provides feedback in the form of reward signals. The agent’s objective is to maximize the sum of rewards. In this work, we study how to explore efficiently. In particular we wish to compute near-optimal policies using the least possible amount of interactions with the environment (in the form of observed transitions). In general, we may be either interested in the performance of the agent during the learning phase or we may only care for the performance of the learned policy. In the first setting, we can measure the performance of the agent by itscumulative regret which is the difference between the total reward collected by an optimal policy and the total reward collected by the agent during the learning. Therefore, the agent is encouraged toexplore new policies but also exploitits current knowledge [Bartlett and Tewari,2009,Jaksch et al.,2010]. Another performance measure related to the regret consists in counting the number of times during the learning that the value of the policy used by the agent isεfar from the optimal one. The minimization of this count is formalized in the PAC-MDP setting introduced byKakade[2003], see alsoDann and Brunskill [2015] andDann et al. [2017]. The second setting and our central focus in this paper is called pure-exploration where the agent is free to make mistakes during the learning and explore more vigorously [Fiechter, 1994, Kearns and Singh, 1998, Even-Dar et al., 2006]. We provide results for two pure-exploration settings when the environment is an episodic Markov decision process (MDP): thereward-free exploration (RFE) and thebest-policy identification (BPI).

(3)

Best-policy identification In BPI, an agent interacts with the MDP, observingtransitionsand rewards, to output anε-optimal policy with probability at least1−δ[Fiechter,1994]. Most of the work on BPI assumes that the agent has access to agenerative model (oracle,Kearns and Singh, 1998). Having an oracle access means that the agent can simulate a transition from any sate-action pair. In particular,Azar et al.[2013] show that the optimal rate of the sample complexity, defined in this case as the numbernof oracle calls for getting anε-optimal policy with probability at least 1−δis of order2 Oe H4SAlog(1/δ)/ε2

whereS is the size of the state space, Ais the size of the action space, andH is the horizon (see Table1and alsoAgarwal et al.,2020,Sidford et al.,2018).

TheOe notation hides terms that are poly-log inH, S, A, ε,and log(1/δ).

Even if the oracle access is reasonable in some situations (games, physics simulators, . . . ), we focus on the more challenging and practical setting where the agent has only access to aforward model, meaning that the agent can only sampletrajectories from some predefined initial state. In this setting, the sample complexityτ is the number of trajectories that are necessary to output anε-optimal policy with probability at least1−δ (which leads ton=Hτ sampled transitions).

A straightforward but indirect approach to BPI, suggested for example byJin et al. [2018], is to run a regret-minimizing algorithm (for instance,UCBVIofAzar et al.,2017) for a sufficiently large number K of episodes, and output a policy bπ that is chosen uniformly at random among the K policies executed by the agent. Unfortunately, this indirect approach is sub-optimal with respect to the error probabilityδ. Indeed, the resulting sample complexity scales with 1/δ2, instead of the expectedlog(1/δ), as can be seen in Table1and Section4. Recently,Kaufmann et al.[2020]

proposedRF-UCRL, which adapts an episodic version of aUCRL-type algorithm [Jaksch et al.,2010]

to best-policy identification. In essence, they replace the random choice of the predicted policy by a data-dependent choice. This algorithm enjoys the correct dependence onδprescribed by the lower bound ofDann and Brunskill,2015, but suffers a sub-optimal dependence onS, the size of the state space, whenεis small, as well as a sub-optimal dependence on the horizon H (Table1).

As an answer to the above sub-optimalities, we propose BPI-UCBVI, a new algorithm with a sample complexity ofOe SAH3log(1/δ)/ε2, which is optimal in terms ofS, A, ε,andδ, according to the lower bound ofDann and Brunskill[2015]. Moreover, we believe that the dependence onH cannot be improved when the transitions are non-stationary, see the discussion below. BPI-UCBVI is based onUCBVI of Azar et al. [2017]. Moreover, it relies on a non-trivial upper bound on the simple regret of a UCBVI-type algorithm (Lemma 2) that shaves the extra S factor of RF-UCRL while keeping the right dependence onδ. The main feature of this upper bound is that it can be computed in the empirical MDP and therefore is accessible to the agent.

Reward-free exploration Efficient exploration is especially difficult when the reward signals are sparse, as the agent needs to interact with the environment while receiving almost no feedback.

To address such situations, we also studyreward-free exploration introduced byJin et al. [2020], where the interaction with the environment is split into two phases: (i) anexploration phase, in which the agent learns the transition modelbpof the MDP by interacting with the environment for a given number of episodes (still with a forward model); and (ii) aplanning phase, in which the agent receives a reward functionrand computes the optimal policy for the MDP parameterized by (r,pb). Given an accuracy parameter ε, we measure the performance of the agent by the number of trajectories required to compute a policy in the planning phase, that isε-optimalfor any given reward function rwith probability at least1−δ.

Our interest in RFE has two major reasons. First, in some applications, it is necessary to compute good policies for a wide range of reward functions. In such case, RFE allows to satisfy this need with only a single exploration phase. Second, RFE gives us good strategies for exploring the environment especially when the reward signal is very sparse or unknown.

One approach to pure exploration is to rely on knowncumulative-regret minimizationmethods and their guarantees. This path is taken byRF-RL-Explore of Jin et al.[2020]. More precisely, RF-RL-Explore builds upon the EULER algorithm by Zanette and Brunskill [2019] by running one instance of this algorithm for each state s and each episode step h with a reward function incentivizing the visit of states in step h. The leading term in their sample complexity bound scales withOe S2AH5log(1/δ)/ε2

for MDPs withSstates,Aactions, and horizonH, which is sub- optimal inH (Table1). Kaufmann et al.[2020] proposeRF-UCRL, an alternative algorithm that is reminiscent of the original algorithm proposed byFiechter[1994] for BPI with an improved sample complexity ofOe SAH4(log(1/δ) +S)/ε2. The main idea behind the algorithm ofFiechter[1994]

is to build upper confidence bounds on the estimation error of the value function of any policy

(4)

underany reward function, and then act greedily with respect to these upper bounds to minimize the estimation error. Using a similar approach,Wang et al.[2020] study reward-free exploration with a particular linear function approximation, providing an algorithm with a sample complexity of order d3H6log(1/δ)/ε2, where d is the dimension of the feature space. Finally, Zhang et al.

[2020] study a setting in which there are onlyN possible reward functions in the planning phase, for which they provide an algorithm whose sample complexity isOe H5SAlog(N) log(1/δ)/ε2

. In this work, we present RF-Express with sample complexity ofOe SAH3(log(1/δ) +S)/ε2, which improves the bound ofKaufmann et al. [2020] by a factor of H. As discussed below, we believe thatRF-Expresshas an optimal dependence onH for general, non-stationary transitions.

A standard path to get such improved dependence is via confidence bonuses built using an empirical Bernstein inequality [Azar et al., 2012, 2017, Zanette and Brunskill, 2019] to make appear a variance term and then to sharply upper-bound these variance terms with a Bellman type equation for the variances (see Appendix E.1 or Azar et al., 2012). However, this standard path is far less clear for RFE as the agent does not observe the rewards and therefore cannot compute the empirical variance of the values! Therefore, one of our main technical contributions is to tackle this challenge by introducing a new empirical Bernstein inequality derived from a control of the transition probabilities (AppendixE.3) and applying the Bellman-type equation for the variances to construct exploration bonuses that do not require a computation of empirical variances. Surprisingly, the bonuses used inRF-Express scale with1/n(s, a)where n(s, a)is the number of times the state-action pair(s, a)was visited, instead of the usual1/p

n(s, a)bonus.

Remark 1(on the optimality of BPI-UCBVI& RF-Express). As summarized in Table1, the algorithms we propose for BPI and RFE are sub-optimalonly by a factor ofH. However, we would like to highlight that the lower bounds in both settings are proved forstationary transitions, that is, when the transition probabilities are the same in every stage of the episode, whereas our upper bounds hold fornon-stationary transitions, i.e., when these probabilities can be different in each stage. In the non-stationary setting, we might expect these lower bounds to have an extra factor ofH,1 in which case our algorithms would be optimal. Notice that in Table1, the upper bounds are for the non-stationary case, whereas the available lower bounds are for the stationary case.

Contributions To sum up, we highlight our major contributions.

• BPI: we provideBPI-UCBVI, with a sample complexity ofOe H3SAlog(1/δ)whenεis small enough. Up to a factor ofH and poly-log terms, it matches the lower bound ofDann and Brunskill[2015] and improves the dependence either onH,1/δorSwith respect to previous work.

• RFE: we provideRF-Express with a sample complexity ofOe H3SA(log(1/δ) +S)/ε2. Up to a factor of H and poly-log terms, our rate matches simultaneously the lower bound Ω(H2S2A/ε2) by Jin et al. [2020], effective when δ is fixed and ε goes to zero, and the lower bound Ω(H2SAlog(1/δ)/ε2) by Dann and Brunskill [2015], effective when ε is fixed andδgoes to zero.

• Due to the absence of the rewards in RFE, known techniques to get the optimal dependence in the horizonH [Azar et al.,2012,Zanette and Brunskill,2019] do not apply. We therefore develop a new analysis that relies on the use of exploration bonuses scaling with1/ninstead of the standard1/√

n.

1Jin et al.[2018] prove that this is indeed the case forregretlower bounds.

2Azar et al. [2012], express both the upper and the lower bounds in the total number of calls to the generative model (instead of trajectories)and prove them for theγ-discounted infinite horizon. They are of orderSA(1γ)−3ε−2log(1/δ). We translate them to the episodic setting by replacing(1γ)−1by the horizonH. In particular, the term1/(1γ)3translates toH3and we includean extraH factor in the upper bounddue to the non-stationary transitions, i.e., when the transition probabilities depend on the stageh[H].

3See the last paragraph of Section4.1for a discussion.

4We combined the H2S22

result ofJin et al. [2020] and the H2SAlog(1/δ)/ε2

result ofDann and Brunskill[2015].

(5)

Algorithm Setting Upper bound (non-stationary case)

Lower bound (stationary case) EmpiricalQVI[Azar et al.,2012]2 gen. model H4εSA2 log 1δ H3SA

ε2 log 1δ

UCBVI+ random recomm.3 BPI H3εSA2 log(1/δ)δ2

H2SA ε2 log 1δ BPI-UCRL[Kaufmann et al.,2020] BPI H4SA

ε2 log 1δ +S

BPI-UCBVI(this work) BPI H3εSA2 log 1δ

UCBZero[Zhang et al.,2020] RFE,N tasks H5SAε2log(N)log 1δ H2SAlog(N) ε2

RF-RL-Explore[Jin et al.,2020] RFE H7Sε2Alog3(1δ) +H5εS22Alog 1δ

H2SA ε2 log 1δ

+S4 RF-UCRL[Kaufmann et al.,2020] RFE H4εSA2 log 1δ

+S

RF-Express(this work) RFE H3SA

ε2 log 1δ +S

Table 1: Best-policy identification (BPI) and reward-free exploration (RFE) algorithms with their respective upper bounds on the sample complexity, expressed in terms of the number of trajectories required by the algorithms.2 The factors and terms that are poly-log inS, A, H, ε, and log(1/δ) are omitted.

2 Setting

We consider a finite episodic MDP S,A, H,{ph}h∈[H],{rh}h∈[H]

, where S is the set of states, A is the set of actions, H is the number of steps in one episode, ph(s0|s, a) is the probability transition from state s to state s0 by taking the action a at step h, and rh(s, a) ∈ [0,1] is the bounded deterministic reward received after taking the actiona in state s at step h. Note that we consider the general case of rewards and transition functions that are possibly non-stationary, i.e., that are allowed to depend on the decision stephin the episode. We denote by S andA the number of states and actions, respectively.

Learning problem The agent, to which the transitions are unknown, interacts with the envi- ronment in episodes of length H, with a fixed initial states1.5 At each step h∈ [H], the agent observes a state sh ∈ S, takes an action ah ∈ A and makes a transition to a new state sh+1

according to the probability distribution ph(sh, ah). In BPI, the agent receives a deterministic reward rh(sh, ah) at each step h, for fixed reward functions r ,{rh}h∈[H], and it is required to output anε-optimal policywith respect tor. In RFE, no rewards are observed during exploration, and the agent is required to output an estimate of the transition probabilities which can be used afterwards to compute anε-optimal policy forany reward function.

Policy & value functions A deterministic policyπis a collection of functionsπh:S 7→ Afor allh∈[H], where everyπh maps each state to asingle action. The value functions ofπ, denoted byVhπ, are defined as

Vhπ(s),E

" H X

h0=h

rh0(sh0, ah0)

sh=s

#

, where ah0h0(sh0) and sh0+1∼ph0(sh0, ah0) forh∈[H]. The optimal value functions are defined as Vh?(s),maxπVhπ(s). Both Vhπ andVh? satisfy the Bellman equations [Puterman,1994], that are expressed using the Q-value functionsQh andQ?h in the following way,

Vhπ(s) =πhQπh(s), with Qπh(s, a),rh(s, a) +phVh+1π (s, a), and Vh?(s) = max

a Q?h(s, a), with Q?h(s, a),rh(s, a) +phVh+1? (s, a),

5As explained byFiechter[1994] andKaufmann et al.[2020], if the first state is sampled randomly ass1p0, we can simply add an artificial first states0such that for any actiona, the transition probability is defined as the distributionp0(s0, a),p0.

(6)

where by definition, VH+1? , VH+1π , 0. Furthermore, phf(s, a) , Es0∼ph(·|s,a)[f(s0)] denotes the expectation operator with respect to the transition probabilitiesph and (πhg)(s),πhg(s), g(s, πh(s))denotes the composition with the policyπat steph.

Empirical MDP Let(sih, aih, sih+1)be the state, the action, and the next state observed by an algorithm at stephof episode i. For any step h∈[H] and any state-action pair (s, a)∈ S × A, we letnth(s, a),Pt

i=11

(sih, aih) = (s, a) be the number of times the state action-pair(s, a)was visited in stephin the firsttepisodes andnth(s, a, s0),Pt

i=11

(sih, aih, sih+1) = (s, a, s0) . These definitions permit us to define the empirical transitions as

pbht(s0|s, a), nth(s, a, s0)

nth(s, a) ifnth(s, a)>0 and pbht(s0|s, a), 1

S otherwise.

Based on the empirical transitions and on exploration bonuses, we introduce various data-dependent quantities that are useful for designing algorithms for either the BPI or the RFE objective. While the former allows the agent to access the reward function r during exploration, the latter does not. Therefore, in all data-dependent quantities introduced in Section 3 to design a RFE algo- rithm, we always materialize a possible dependency onr. In particular, we denote by Vbht,π(s;r) andQbt,πh (s, a;r)the value and the action-value functions of a policyπin the MDP with transition kernelspbtand reward functionr. In Section4, in which the reward function is fixed, we drop the dependency onrand use the simpler notationVbht,π(s)andQbt,πh (s, a).

3 Reward-free exploration

In this section, we consider reward-free exploration (RFE) where the agent does not observe the rewards during the exploration phase. Again, as the value functions defined in Section2depend on a reward function r, we sometimes use the notation Vh(s;r) and Qh(s, a;r) instead ofVh(s) andQh(s, a).

Reward-free exploration In RFE , the agent interacts with the MDP in the following way.

At the beginning of the episode t, the agent decides to follow a policy πt, called the sampling rule, based only on the data collected up to episode t−1. Then, a reward-free episode zt , (st1, at1, st2, at2, . . . , stH, atH)is generated starting from the the initial statest1,s1by taking actions ath = πht(sth) and, for h > 1, observing next-states according to sth ∼ ph(sth−1, ath−1). This new trajectory is added to the dataset Dt ,Dt−1∪ {zt}. At the end of each episode, the agent can decide to stop collecting data, according to a random stopping time τ and outputs an empirical transition kernelpbbuilt with the datasetDτ.

Any RFE agent is therefore made of a triple((πt)t∈N?, τ,pb). Our goal is to design an agent that is(ε, δ)-PAC,probably approximately correct, according to the following definition, for which the number of exploration episodesτ, i.e., thesample complexity, is as small as possible.

Definition 1(PAC algorithm for RFE). An algorithm is(ε, δ)-PAC for reward-free exploration if P

for any reward functionr, V1?(s1;r)−Vbπ

? r

1 (s1;r)≤ε

≥1−δ,

whereπb?r is the optimal policy in the empirical MDP whose transitions are given by the transition kernelbpreturned by the algorithm and whose reward function isr.

3.1 RF-Express

In this section, we present theRF-Express algorithm along with a high-probability bound on its sample complexity. RF-Express relies on upper bounds on the estimation error between the true value functions and their empirical counterparts. We start with the motivation for the design choices forRF-Express. We then introduce quantities to which we engrave our choices and which we subsequently use in the definition algorithm. In the algorithmic template we proceed asFiechter [1994] andKaufmann et al.[2020] by upper bounding the estimation-error for all the policieswith the striking difference that we only upper bound it at the initial state. We finish this part by providing more intuition and discussion, in particular, we provide technical insights into what RF-Expressis optimizing and thenexplain our1/nversus 1/√

nexploration bonuses, the reasons for choosing them and the challenge with analysing them.

(7)

Estimation error Given a policyπand an arbitrary reward functionr, we define the estimation error as the absolute difference between the Q-value ofπin the empirical MDP and its Q-value in the true MDP. Precisely, after episodet, for all(s, a, h), we define

beht,π(s, a;r),

bQt,πh (s, a;r)−Qπh(s, a;r) .

To control the approximation error of the value ofany policy forany reward functionstarting from the initial states1, we introduce the functionsWht(s, a)defined inductively byWH+1t (s, a),0for all(s, a)∈ S × Aand for allh∈[H] and(s, a)∈ S × A,

Wht(s, a),min H,9H2β(nth(s, a), δ) nth(s, a) +

1 + 1

H

X

s0

pbht(s0|s, a) max

a0 Wh+1t (s0, a0)

!

, (1) whereβ(nth(s, a), δ)is a threshold that depends on how we build the confidence sets for the tran- sitions probabilities. Notice that theWht are all independent of the reward functionr. As shown in the next lemma with proof in AppendixC, the functionW1t(s1, a)can be used to upper bound the estimation error of any policy under any reward function in the initial states1.

Lemma 1. With probability at least 1−δ, for any episodet, policy π, and reward functionr, be1t,π(s1, π1(s1);r)≤2eq

maxa∈AW1t(s1, a) + max

a∈AW1t(s1, a).

In particular, the above holds on the eventF defined in AppendixA.

With all the above definitions, we are now ready to outline ourRF-Express algorithm.

sampling rule: the policyπt+1 is the greedy policy with respect toWht,

∀s∈ S,∀h∈[H], πt+1h (s) = arg max

a∈A

Wht(s, a) stopping rule: τ= inf

t∈N: 2e q

π1t+1W1t(s1) +πt+11 W1t(s1)≤ε/2

prediction rule: output the empirical transition kernelpb=pbτ

Next, we provide a bound on the sample complexity ofRF-Express with a proof in AppendixC.

Theorem 1. For δ ∈ (0,1), ε ∈ (0,1], RF-Express with threshold β(n, δ) , log(3SAH/δ) + Slog(8e(n+ 1)) is (ε, δ)-PAC for reward-free exploration. Moreover, RF-Express stops after τ steps where, with probability at least1−δ,

τ≤ H3SA

ε2 (log(3SAH/δ) +S)C1+ 1 and whereC1,1650e4log e18(log(3SAH/δ) +S)H3SA/ε

.

As a consequence, the sample complexity of RF-Expressis of order Oe H3SA(log(1/δ) +S) matches the lower bound of Ω(H2S2A/ε2) of Jin et al.[2020] up to a factor of H and poly-logand terms. This lower bound is informative in the regime whereδis considered as fixed andεtends to zero. Moreover, up to a factorH, our result also matches the lower bound ofΩ H2SAlog(1/δ)/ε2 given byDann and Brunskill[2015] which is informative in the regime whereεis fixed andδtends to0. In Remark 1, we discuss the optimality of RF-Expresswith respect toH. As we see in the next section, the quadratic dependence onS can be avoided for BPI.

What isRF-Expressoptimizing? Contrary toRF-UCRLofKaufmann et al.[2020],RF-Express doesnot build upper bounds on all estimation errorsbeht,π(s, a;r) for allh∈[H] but only for the one at the initial statebe1t,π(s1, π1(s1);r). Moreover, the upper bound is not W1t(s1, a)itself, but a function of this quantity, as can be seen in Lemma1. Hence, if RF-Express actually follows the

(8)

optimism-in-the-face-of-uncertainty principle, what quantities areWhtupper bounding? To answer this question and provide an intuition on the sampling rule ofRF-Express, fix a policyπand letPπ be the probability distribution governing a random trajectory(s1, a1, s1, a2, . . . , sH, aH)∼Pπ in the MDP. Next, letPbt,πbe the probability distribution of a trajectory(st1, at1, st1, at2, . . . , stH, atH)∼ Pbt,π in the empirical MDP built using the dataset Dt at episodet. Assuming that all the state action pairs have been visited at least once at timet, using the chain rule (seeGarivier et al.,2019) we can compute the Kullback-Leibler divergence between these two probability distributions as

KL(Pbt,π, Pπ) =

H

X

h=1

X

s,a

pbht,π(s, a) KL pbht,π(s, a), ph(s, a) ,

where pbht,π(s, a) is the probability to reach state-action (s, a) at step h under policy π in the empirical MDP in episodet. Notice now that the bonus of the formβ(n, δ)/nused to defineWtis by design chosen to be an upper-confidence bound on the Kullback-Leibler divergence between the empirical transition probability and the transition probability. Indeed, in Appendix A we show that with high probability, for all(s, a)∈ S × A andh∈[H],

KL pbht(s, a), ph(s, a)

≤β(nth(s, a), δ) nth(s, a) · Therefore, omitting the clipping toH in (1), we have that

maxπ KL(Pbt,π, Pπ).H2π1t+1W1t(s1).

Therefore,RF-Expresscan be interpreted as an algorithm minimizing an upper-confidence bound on the loss ofmaxπKL(Pbt,π, Pπ), which requires bonuses of the formβ(n, δ)/ninstead ofp

β(n, δ)/n. Notice that this loss is of the same flavor as the one introduced byHazan et al.[2019].

Bonuses of 1/n versus 1/√

n Our approach differs from the bonuses typically used in regret minimization (e.g., Azar et al., 2017) and in prior work in reward-free exploration [Kaufmann et al.,2020,Zhang et al.,2020], which uses bonuses proportional top

1/n(s, a). Intuitively, since 1/n-bonuses decay faster withn, our algorithm is more exploratory: once a state-action pair(s, a) has been visited, the bonus associated to it will be more strongly reduced than if we usedp

1/n- bonuses and the algorithm tends to visit other state-action pairs before returning to(s, a)again.

Technically, this might seem very surprising. Indeed, if we want to estimate the meanµof a random variableX with an estimator bµn computed withni.i.d. samples fromX, the error|µ−µbn| scales with p

1/n by Hoeffding’s inequality, which explains the shape of the bonuses used in previous works. However, instead of bounding the error|µ−bµn|, our concentration inequalities based on the KL divergence give us a bound on the quadratic term(µ−µbn)2,which scales with 1/n. This allows us to use a Bellman-type equation for the variance of the value functions and reduce the sample complexity by a factor ofH, similarly to previous work on regret minimization [Azar et al., 2017]. The main challenge in our case is that, in reward free exploration, we need to upper bound the sum of variances forany possible value function, which makes this technique considerably more challenging to analyze than for the regret minimization.

3.2 Proof sketch

We first sketch the proof of Lemma1. We begin as it is done in the analysis of Kaufmann et al.

[2020]. For a fixed policyπand an arbitrary reward functionr, we decompose the estimation error of the Q-value function ofπat the state-action pair(s, a)as, for all reward functionr,

beht,π(s, a;r)≤

bQht,π(s, a;r)−Qπh(s, a;r) ≤

(bpht−ph)Vh+1π (s, a)

+pbht|Vbh+1t,π −Vh+1π |(s, a)

=

(pbht−ph)Vh+1π (s, a)

+pbhtπh+1t beh+1t,π (s, a;r).

Similarly toAzar et al.[2017] andZanette and Brunskill[2019], to obtain the optimal dependency with respect to the horizonH, we would like to apply the Bernstein inequality to control the first term. Since we need to do it for all value functions of all policies, we could use a covering of this function space and conclude with a union bound, seeDomingues et al.[2020]. Instead we show,

(9)

via Lemma10 in Appendix E.3, that from a control of the deviations of the empirical transition probabilities such that

KL pbht(s, a), ph(s, a)

≤β(nth(s, a), δ) nth(s, a) with high probability, we deduce an empirical Bernstein inequality,

beht,π(s, a;r)≤2 s

Var

pbht(Vbh+1t,π)(s, a;r) H2

H2β(nth(s, a), δ) nth(s, a) ∧1

+ 9H2β(nth(s, a), δ) nth(s, a) +

1 + 1

H

pbhtπbh+1t beh+1t,π (s, a;r),

where the variance ofVbh+1t,π,in particular with respect to pbht(·|s, a)is defined as Var

bpht(Vbh+1t,π)(s, a;r) =X

s0

bpht(s0|s, a)

Vbh+1t,π(s0;r)−Ez∼pbht(·|s,a)

h

Vbh+1t,π(z;r)i2

. Therefore, definingZH+1t,π (s, a;r),0 and recursively the functions

Zht,π(s, a;r),min H,2 s

Var

pbht(bVh+1t,π)(s, a;r) H2

H2β(nth(s, a), δ) nth(s, a) ∧1

+ 9H2β(nth(s, a), δ) nth(s, a) +

1 + 1

H

pbhtπh+1Zh+1t,π (s, a;r)

! , we prove by induction that for all(s, a, h),

beht,π(s, a;r)≤Zht,π(s, a;r). (2) We nowsplit Zt,π in two terms. The first term is the one with the bonus inp

1/nand the second one with the bonus in 1/n. Precisely, for all (s, a), we define recursively two other quantities YH+1t,π (s, a;r),WH+1t,π (s, a),0 and

Yht,π(s, a;r),2 s

Var

pbht(Vbh+1t,π)(s, a;r) H2

H2β(nth(s, a), δ) nth(s, a) ∧1

+

1 + 1

H

pbhtπh+1Yh+1t,π(s, a;r) Wht,π(s, a),min

H,9H2β(nth(s, a), δ) nth(s, a) +

1 + 1

H

pbhtπh+1Wh+1t,π (s, a)

.

We then prove by induction (AppendixC, Step 2 of the proof of Lemma1) that for allh, s, a, Zht,π(s, a;r)≤Yht,π(s, a;r) +Wht,π(s, a). (3) Note that althoughZht,π(·;r) is a high-probability upper bound onbeht,π(·;r), we cannot use it to build a sampling rule reducing the errors as it still depends on the reward functionrthrough the empirical variance term, and this knowledge is only available in the planning phase. To obtain an upper bound onZht,π(·;r)which does not depend onr, we now further upper-boundYt,π(·;r). The key tool for this purpose is to use the Bellman equation for the variances, see AppendixE.1.

We denote by pbht,π(s, a) the probability of reaching the state-action pair (s, a) at step h under the policy π in the empirical MDP at time t. Using Cauchy-Schwarz inequality, Lemma 7 in AppendixE.1, and the fact that that variance of the sum of reward is upper bounded by H2, we

(10)

get

π1Y1t,π(s1;r) = 2X

s,a H

X

h=1

pbt,πh (s, a)

1 + 1 H

h−1s

Varbpht(Vbh+1t,π)(s, a;r) H2

H2β(nth(s, a), δ) nth(s, a) ∧1

≤2e v u u t

X

s,a H

X

h=1

pbht,π(s, a)Var

pbht(bVh+1t,π)(s, a;r) H2

v u u t

X

s,a H

X

h=1

pbht,π(s, a)

H2β(nth(s, a), δ) nth(s, a) ∧1

≤2e v u u u t

1 H2Eπ,pbht

H

X

h=1

rh(sh, ah)−Vb1π(s1;r)

!2

 v u u t

X

s,a H

X

h=1

pbht,π(s, a)

H2β(nth(s, a), δ) nth(s, a) ∧1

≤2e v u u t

X

s,a H

X

h=1

pbht,π(s, a)

H2β(nth(s, a), δ) nth(s, a) ∧1

≤2e q

W1t,π(s1), where the last inequality is proved in AppendixC(Step 3 of the proof of Lemma 1).

Combining inequalityπ1Y1t,π(s1;r)≤2e q

W1t,π(s1)with (2) and (3) yields that, for allπ,

be1t,π(s1, π1(sa);r)≤2e q

π1W1t,π(s1) +π1W1t,π(s1).

Finally, we note that by construction,π1W1t,π(s1)≤maxa∈AW1t(s1, a), which allows us to conclude the proof of Lemma1.

Next, we sketch the proof of Theorem1. The fact that RF-Express is (ε, δ)-PAC is a simple consequence of Lemma1. Indeed, on an event of probability at least1−δ, if the algorithm stops at timeτ we know that for all policyπand for all reward functionr,

ε

2 ≥2eq

maxa∈AW1τ(s1, a) + max

a∈AW1τ(s1, a)≥be1τ,π(s1, π1(s1);r) =|Vb1τ,π(s1;r)−V1π(s1;r)|.

Therefore, still on the same event it holds that

V1?(s1;r)−V1bπ?(s1;r) =V1?(s1;r)−Vb1τ,πr?(s1;r) +Vb1τ,π?r(s1;r)−Vb1τ,bπ?r(s1;r) +Vb1τ,bπ?r(s1;r)−V1πbr?(s1;r)

≤ |V1πr?(s1;r)−Vbτ,π

? r

1 (s1;r)|+|Vbτ,bπ

? r

1 (s1;r)−Vπb

? r

1 (s1;r)| ≤ε .

The proof of the bound on the sample complexity is close to the one of a regret bound. We fix a timet < τ. We start by proving an upper-bound on W1t(s1, πt+1(s1)). For that using again the empirical Bernstein inequality of Lemma10, with high probability, it holds that

Wht(s, a)≤14H2

β(nth(s, a), δ) nth(s, a) ∧1

+

1 + 3

H

phπt+1h+1Wh+1t (s, a).

We denote bypth(s, a)the probability to reach the state-action pair(s, a)at stephunder policyπt in the true MDP. Unfolding the previous inequality and switching to the pseudo-counts, defined byn¯th(s, a),Pt

`=1p`h(s, a), by Lemma8proved in AppendixE.2we get π1t+1W1t(s1)≤48e3H2

H

X

h=1

X

s,a

pt+1h (s, a)β(¯nth(s, a), δ)

¯

nth(s, a)∨1 · (4)

Sincet < τ we know that due to stopping rule ε≤2e

q

πt+11 W1t(s1) +π1t+1W1t(s1).

Summing the previous inequalities for 0 ≤ t < τ then using the Cauchy-Schwarz inequality we obtain

τ ε≤

τ−1

X

t=0

2e q

πt+11 W1t(s1) +π1t+1W1t(s1)≤2e v u u tτ

τ−1

X

t=0

πt+11 W1t(s1) +

τ−1

X

t=0

πt+11 W1t(s1).

(11)

We now upper-bound the sum that appears in the left-hand terms. Using successively (4),β(·, δ) is increasing, Lemma9of AppendixE.2, we have

τ−1

X

t=0

π1t+1W1t(s1)≤48e3H2

τ−1

X

t=0 H

X

h=1

X

s,a

pt+1h (s, a)β(¯nth(s, a), δ)

¯

nth(s, a)∨1

≤48e3H2β(τ−1, δ)

H

X

h=1

X

s,a τ−1

X

t=0

¯

nt+1h (s, a)−n¯th(s, a)

¯

nth(s, a)∨1

≤192e3H3SAlog(τ+ 1)β(τ−1, δ).

Therefore, combining with the above inequality with the previous one, we get τ ε≤30e3p

τ H3SAlog(τ+ 1)β(τ−1, δ) + 192e3H3SAlog(τ+ 1)β(τ−1, δ).

Using Lemma12, we invert the inequality above and obtain an upper bound onτ, which allows us to conclude the proof of the theorem.

4 Best-policy identification

Unlike in the previous section, we now consider a more standard setup in which there is a single reward functionrand in which the agent observes the reward at each step, during the exploration phase. To ease the presentation, we drop the dependence on the rewardr in all data-dependent quantities introduced in this section.

Best-policy identification In BPI, the agent interacts with the MDP in a way described in Section 2. Notice that the difference from Section 3 is that the agent also observes the reward.

In each episodet, the agent follows a policyπt(the sampling rule) based only on the information collected up to and including episodet−1. At the end of each episode, the agent can decide to stop collecting data (we denote byτ its random stopping time) and outputs a guessπb for the optimal policy.

A BPI algorithm is therefore made of a triple((πt)t∈N, τ,bπ). The goal is to build an(ε, δ)-PAC algorithm according to the following definition, for which thesample complexity, that is the number of exploration episodesτ, is as small as possible.

Definition 2 (PAC algorithm for BPI). An algorithm is(ε, δ)-PAC for best policy identification if itreturns a policy bπafter some number of episodesτ that satisfies

P

V1?(s1)−V1bπ(s1)≤ε

≥1−δ.

4.1 BPI-UCBVI

Similarly toAzar et al.[2017] andZanette and Brunskill[2019], we define upper confidence bounds on the optimal Q-value and value functions as

Qeth(s, a),min H, rh(s, a) + 2 s

Var

bpht(eVh+1t )(s, a)β?(nth(s, a), δ)

nth(s, a) + 10H2β(nth(s, a), δ) nth(s, a) + 1

Hpbht(eVh+1t

˜Vh+1t )(s, a) +pbhtVeh+1t (s, a)

! , Veht(s),max

a∈AQeth(s, a), VeH+1t (s),0,

whereβ? is some exploration rate (that doesnot scale with the number of statesS) and

˜Vtis a lower confidence bound on the optimal value function; see AppendixBfor a complete definition.

As in RFE, we need to build an upper confidence bound on the gapV1?(s1)−V1πt+1(s1),between the value of the optimal policy and the value of the current policy, to define the stopping rule. We

(12)

recursively define the functionsGtas GH+1t (s, a),0for all(s, a)and for all(s, a, h)as

Ght(s, a),min H,4 s

Var

pbth(Veh+1t )(s, a)β?(nth(s, a), δ)

nth(s, a) + 25H2β(nth(s, a), δ) nth(s, a) +

1 + 3

H

pbhtπh+1t+1Gth+1(s, a)

!

forh≤H.

We prove the following result in AppendixD.

Lemma 2. With probability at least 1−δ, for allt,

V1?(s1)−V1πt+1(s1)≤πt+11 Gt1(s1). In particular it holds on the eventG defined in AppendixA.

We are now ready to define ourBPI-UCBVIalgorithm.

sampling rule: the policyπt+1 is the greedy policy with respect toQeth,

∀s∈ S,∀h∈[H], πht+1(s) = arg max

a∈A

Qeth(s, a) stopping rule: τ= inf

t∈N:πt+11 Gt1(s1)≤ε prediction rule: bπ=πτ+1

We provide a sample complexity bound for BPI-UCBVI in the next theorem, which we prove in AppendixD.

Theorem 2. Forδ∈(0,1), ε∈(0,1/S2],BPI-UCBVI using thresholds β(n, δ),log 3SAH/δ + Slog 8e(n+ 1)

andβ?(n, δ),log(3SAH/δ) + log 8e(n+ 1)

is(ε, δ)-PAC for best policy explo- ration. Moreover, with probability1−δ,

τ≤ H3SA

ε2 log(3SAH/δ) + 1 C1+ 1, whereC1,147e22log e25(log(3SAH/δ) +S)H3SA/ε

.

Therefore, the rate of BPI-UCBVI is of order Oe H3SAlog(1/δ)/ε2

when ε is small enough and matches the lower bound ofΩ H2SAlog(1/δ)/ε2byDann and Brunskill [2015] up to anH and poly-log terms. To the best of our knowledge, BPI-UCBVI is the first algorithm for BPI whose sample complexity has an optimal dependence on S, A, ε, and δ. In Remark 1, we discuss the optimality of BPI-UCBVIwith respect toH.

From regret-minimization to BPI The main difficulty for converting a regret-minimizing algorithm to BPI lies in high-probability prediction of anε-optimal policy. Indeed, assume that a regret-minimizing algorithm generates a sequence of policy (πt)t∈[T] for a number of episodes T ∈Nlarge enough, with a controlled regret such that with probability at least1−δ0,

T

X

t=1

V?(s1)−V1πt(s1)≤Cp

H3SAlog(1/δ0)T ,

for some constantC. Then a straightforward choice for the prediction is to choose bπ at random among the sequence(πt)t∈[T], as suggested byJin et al.[2018]. Markov’s inequality implies that

P

V1?(s1)−V1πt(s1)≥ε

≤ 1 εE

"

1 T

T

X

t=1

V?(s1)−V1πt(s1)

#

≤ 1 ε C

rH3SA

T log(1/δ0) +δ0H

! .

(13)

Therefore, if we chooseδ0, Hε · δ2 and

T , 2H3SA ε2δ2 log

2H εδ

,

we getP V?(s1)−V1πt(s1)≥ε

≤δ which implies that the described algorithm is(ε, δ)-PAC for BPI. Unfortunately, the choice ofT above gives a sample complexity that scales with1/δ2whereas we prefer and expectlog(1/δ).

Another natural choice for the prediction is to return the average policy. Precisely, let us define

¯

pht(s, a),Pt

`=1p`h(s, a)be the average state-action distribution, we define the average distribution as the one for which its state-action distribution is preciselyp¯t,

¯

πth(a|s), ( p¯t

h(s,a) P

b∈Ap¯th(s,b) if P

b∈Ath(s, b)>0, and

1/A otherwise.

Note that the average policyπ¯th(a|s)is not the average of the policiesPt

`=1π`h(a|s)/t. The interest of¯πht is that the average regret at timeT isequal to the simple regret of the policyπ¯T,

V1?(s1)−V1π¯T(s1)≤ 1 T

T

X

t=1

V?(s1)−V1πt(s1)≤C

rH3SA

T log(1/δ0),

with probability1−δ0. Choosing δ0 ,δ andT ,CH3SAlog(1/δ)/ε2 and predicting policy π¯T would lead to an(ε, δ)-PAC algorithm for BPI with a minimax optimal sample complexity. How- ever, since the agent does not know the transition kernel, it cannot compute the average policyπ¯T! Therefore,BPI-UCBVIrather relies on an upper bound on the simple regret (see Lemma 2) com- putable by the agent to detect if the policy currently used by the sampling rule isε-optimal.

Acknowledgments The research presented was supported by European CHIST-ERA project DELTA, French Ministry of Higher Education and Research, Nord-Pas-de-Calais Regional Coun- cil, French National Research Agency project BOLD (ANR19-CE23-0026-04). Anders Jonsson is partially supported by the Spanish grants TIN2015-67959 and PCIN-2017-082.

References

Alekh Agarwal, Sham Kakade, and Lin F Yang. Model-based reinforcement learning with a gen- erative model is minimax optimal. InConference on Learning Theory, 2020.

Mohammad Gheshlaghi Azar, Rémi Munos, and Bert Kappen. On the sample complexity of rein- forcement learning with a generative model. In International Conference on Machine Learning, 2012.

Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J. Kappen.Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learning, 91(3):

325–349, 2013.

Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for rein- forcement learning. InInternational Conference on Machine Learning, 2017.

Peter L. Bartlett and Ambuj Tewari.REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. InUncertainty in Artificial Intelligence, 2009.

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities. Oxford University Press, 2013.

Thomas M. Cover and Joy A. Thomas.Elements of information theory. John Wiley & Sons, 2006.

Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. InNeural Information Processing Systems, 2015.

Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In Neural Information Processing Systems, 2017.

Références

Documents relatifs

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

Pre- vious work has shown that human learners rely on referential information to differentiate similar sounds, but largely ignored the problem of taxonomic ambiguity at the

Optimal Regret Bounds for Selecting the State Representation in Reinforcement Learning.. Odalric-Ambrym Maillard, Phuong Nguyen, Ronald Ortner,

As predicted by the theoretical bounds, the per-step regret b ∆ of UCRL2 significantly increases as the number of states increases, whereas the average regret of our RLPA is

Elle s’est donc promenée entre Neuchâtel (où vivaient deux garçons) et entre Lausanne (où habitaient deux adolescentes) afin de suivre les adolescents lors de leur temps libre. Je

Il s’agit pour les étudiant.e.s de master (désormais “E-M2”), dans un module consacré à l’intégration du numérique dans l’enseignement/ apprentissage des

The coordinated shortest hop algorithm works in successive iterations, with the beginning of each iteration synchronized by the destination node. On the ith iteration, the

Hipólito, como los demás, participaba a la labor, porque generalmente, todos se ayudaban entre sí, para aprovechar los días adecuados sobre todo cuando había que aventar,