Fast active learning for pure exploration in reinforcement learning

(1)

HAL Id: hal-02906985

https://hal.inria.fr/hal-02906985v2

Submitted on 27 Jul 2020 (v2), last revised 16 Jul 2021 (v3)

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Fast active learning for pure exploration in reinforcement learning

Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kaufmann, Edouard Leurent, Michal Valko

To cite this version:

Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kaufmann, Edouard Leurent, et al.. Fast active learning for pure exploration in reinforcement learning. [Research Report] DeepMind.

2020. �hal-02906985v2�

(2)

Fast active learning for pure exploration in reinforcement learning

Pierre Ménard [email protected]

Inria Lille, SequeL team

Omar Darwiche Domingues [email protected] Inria Lille, SequeL team

Anders Jonsson [email protected]

Universitat Pompeu Fabra

Emilie Kaufmann [email protected]

CNRS & ULille (CRIStAL), Inria Lille, SequeL team

Edouard Leurent [email protected]

Renault & Inria Lille, SequeL team

Michal Valko [email protected]

DeepMind Paris

Abstract

Realistic environments often provide agents with very limited feedback. When the environment is initially unknown, the feedback, in the beginning, can be completely absent, and the agents may first choose to devote all their effort onexploring efficiently.The exploration remains a challenge while it has been addressed with many hand-tuned heuristics with different levels of generality on one side, and a few theoretically-backed exploration strategies on the other. Many of them are incarnated by intrinsic motivation and in particularexplorations bonuses. A common rule of thumb for exploration bonuses is to use1/√

nbonus that is added to the empirical estimates of the reward, where n is a number of times this particular state (or a state-action pair) was visited. We show that, surprisingly, for a pure-exploration objective of reward-free exploration, bonuses that scale with1/nbring faster learning rates, improving the known upper bounds with respect to the dependence on the horizon H. Furthermore, we show that with an improved analysis of the stopping time, we can improve by a factorH the sample complexity in thebest- policy identification setting, which is another pure-exploration objective, where the environment provides rewards but the agent is not penalized for its behavior during the exploration phase.

1 Introduction

In reinforcement learning (RL), an agent learns how to act by interacting with an environment, which provides feedback in the form of reward signals. The agent’s objective is to maximize the sum of rewards. In this work, we study how to explore efficiently. In particular we wish to compute near-optimal policies using the least possible amount of interactions with the environment (in the form of observed transitions). In general, we may be either interested in the performance of the agent during the learning phase or we may only care for the performance of the learned policy. In the first setting, we can measure the performance of the agent by itscumulative regret which is the difference between the total reward collected by an optimal policy and the total reward collected by the agent during the learning. Therefore, the agent is encouraged toexplore new policies but also exploitits current knowledge [Bartlett and Tewari,2009,Jaksch et al.,2010]. Another performance measure related to the regret consists in counting the number of times during the learning that the value of the policy used by the agent isεfar from the optimal one. The minimization of this count is formalized in the PAC-MDP setting introduced byKakade[2003], see alsoDann and Brunskill [2015] andDann et al. [2017]. The second setting and our central focus in this paper is called pure-exploration where the agent is free to make mistakes during the learning and explore more vigorously [Fiechter, 1994, Kearns and Singh, 1998, Even-Dar et al., 2006]. We provide results for two pure-exploration settings when the environment is an episodic Markov decision process (MDP): thereward-free exploration (RFE) and thebest-policy identification (BPI).

(3)

Best-policy identification In BPI, an agent interacts with the MDP, observingtransitionsand rewards, to output anε-optimal policy with probability at least1−δ[Fiechter,1994]. Most of the work on BPI assumes that the agent has access to agenerative model (oracle,Kearns and Singh, 1998). Having an oracle access means that the agent can simulate a transition from any sate-action pair. In particular,Azar et al.[2013] show that the optimal rate of the sample complexity, defined in this case as the numbernof oracle calls for getting anε-optimal policy with probability at least 1−δis of order² Oe H⁴SAlog(1/δ)/ε²

whereS is the size of the state space, Ais the size of the action space, andH is the horizon (see Table1and alsoAgarwal et al.,2020,Sidford et al.,2018).

TheOe notation hides terms that are poly-log inH, S, A, ε,and log(1/δ).

Even if the oracle access is reasonable in some situations (games, physics simulators, . . . ), we focus on the more challenging and practical setting where the agent has only access to aforward model, meaning that the agent can only sampletrajectories from some predefined initial state. In this setting, the sample complexityτ is the number of trajectories that are necessary to output anε-optimal policy with probability at least1−δ (which leads ton=Hτ sampled transitions).

A straightforward but indirect approach to BPI, suggested for example byJin et al. [2018], is to run a regret-minimizing algorithm (for instance,UCBVIofAzar et al.,2017) for a sufficiently large number K of episodes, and output a policy bπ that is chosen uniformly at random among the K policies executed by the agent. Unfortunately, this indirect approach is sub-optimal with respect to the error probabilityδ. Indeed, the resulting sample complexity scales with 1/δ², instead of the expectedlog(1/δ), as can be seen in Table1and Section4. Recently,Kaufmann et al.[2020]

proposedRF-UCRL, which adapts an episodic version of aUCRL-type algorithm [Jaksch et al.,2010]

to best-policy identification. In essence, they replace the random choice of the predicted policy by a data-dependent choice. This algorithm enjoys the correct dependence onδprescribed by the lower bound ofDann and Brunskill,2015, but suffers a sub-optimal dependence onS, the size of the state space, whenεis small, as well as a sub-optimal dependence on the horizon H (Table1).

As an answer to the above sub-optimalities, we propose BPI-UCBVI, a new algorithm with a sample complexity ofOe SAH³log(1/δ)/ε², which is optimal in terms ofS, A, ε,andδ, according to the lower bound ofDann and Brunskill[2015]. Moreover, we believe that the dependence onH cannot be improved when the transitions are non-stationary, see the discussion below. BPI-UCBVI is based onUCBVI of Azar et al. [2017]. Moreover, it relies on a non-trivial upper bound on the simple regret of a UCBVI-type algorithm (Lemma 2) that shaves the extra S factor of RF-UCRL while keeping the right dependence onδ. The main feature of this upper bound is that it can be computed in the empirical MDP and therefore is accessible to the agent.

Reward-free exploration Efficient exploration is especially difficult when the reward signals are sparse, as the agent needs to interact with the environment while receiving almost no feedback.

To address such situations, we also studyreward-free exploration introduced byJin et al. [2020], where the interaction with the environment is split into two phases: (i) anexploration phase, in which the agent learns the transition modelbpof the MDP by interacting with the environment for a given number of episodes (still with a forward model); and (ii) aplanning phase, in which the agent receives a reward functionrand computes the optimal policy for the MDP parameterized by (r,pb). Given an accuracy parameter ε, we measure the performance of the agent by the number of trajectories required to compute a policy in the planning phase, that isε-optimalfor any given reward function rwith probability at least1−δ.

Our interest in RFE has two major reasons. First, in some applications, it is necessary to compute good policies for a wide range of reward functions. In such case, RFE allows to satisfy this need with only a single exploration phase. Second, RFE gives us good strategies for exploring the environment especially when the reward signal is very sparse or unknown.

One approach to pure exploration is to rely on knowncumulative-regret minimizationmethods and their guarantees. This path is taken byRF-RL-Explore of Jin et al.[2020]. More precisely, RF-RL-Explore builds upon the EULER algorithm by Zanette and Brunskill [2019] by running one instance of this algorithm for each state s and each episode step h with a reward function incentivizing the visit of states in step h. The leading term in their sample complexity bound scales withOe S²AH⁵log(1/δ)/ε²

for MDPs withSstates,Aactions, and horizonH, which is sub- optimal inH (Table1). Kaufmann et al.[2020] proposeRF-UCRL, an alternative algorithm that is reminiscent of the original algorithm proposed byFiechter[1994] for BPI with an improved sample complexity ofOe SAH⁴(log(1/δ) +S)/ε². The main idea behind the algorithm ofFiechter[1994]

is to build upper confidence bounds on the estimation error of the value function of any policy

(4)

underany reward function, and then act greedily with respect to these upper bounds to minimize the estimation error. Using a similar approach,Wang et al.[2020] study reward-free exploration with a particular linear function approximation, providing an algorithm with a sample complexity of order d³H⁶log(1/δ)/ε², where d is the dimension of the feature space. Finally, Zhang et al.

[2020] study a setting in which there are onlyN possible reward functions in the planning phase, for which they provide an algorithm whose sample complexity isOe H⁵SAlog(N) log(1/δ)/ε²

. In this work, we present RF-Express with sample complexity ofOe SAH³(log(1/δ) +S)/ε², which improves the bound ofKaufmann et al. [2020] by a factor of H. As discussed below, we believe thatRF-Expresshas an optimal dependence onH for general, non-stationary transitions.

A standard path to get such improved dependence is via confidence bonuses built using an empirical Bernstein inequality [Azar et al., 2012, 2017, Zanette and Brunskill, 2019] to make appear a variance term and then to sharply upper-bound these variance terms with a Bellman type equation for the variances (see Appendix E.1 or Azar et al., 2012). However, this standard path is far less clear for RFE as the agent does not observe the rewards and therefore cannot compute the empirical variance of the values! Therefore, one of our main technical contributions is to tackle this challenge by introducing a new empirical Bernstein inequality derived from a control of the transition probabilities (AppendixE.3) and applying the Bellman-type equation for the variances to construct exploration bonuses that do not require a computation of empirical variances. Surprisingly, the bonuses used inRF-Express scale with1/n(s, a)where n(s, a)is the number of times the state-action pair(s, a)was visited, instead of the usual1/p

n(s, a)bonus.

Remark 1(on the optimality of BPI-UCBVI& RF-Express). As summarized in Table1, the algorithms we propose for BPI and RFE are sub-optimalonly by a factor ofH. However, we would like to highlight that the lower bounds in both settings are proved forstationary transitions, that is, when the transition probabilities are the same in every stage of the episode, whereas our upper bounds hold fornon-stationary transitions, i.e., when these probabilities can be different in each stage. In the non-stationary setting, we might expect these lower bounds to have an extra factor ofH,¹ in which case our algorithms would be optimal. Notice that in Table1, the upper bounds are for the non-stationary case, whereas the available lower bounds are for the stationary case.

Contributions To sum up, we highlight our major contributions.

• BPI: we provideBPI-UCBVI, with a sample complexity ofOe H³SAlog(1/δ)whenεis small enough. Up to a factor ofH and poly-log terms, it matches the lower bound ofDann and Brunskill[2015] and improves the dependence either onH,1/δorSwith respect to previous work.

• RFE: we provideRF-Express with a sample complexity ofOe H³SA(log(1/δ) +S)/ε². Up to a factor of H and poly-log terms, our rate matches simultaneously the lower bound Ω(H²S²A/ε²) by Jin et al. [2020], effective when δ is fixed and ε goes to zero, and the lower bound Ω(H²SAlog(1/δ)/ε²) by Dann and Brunskill [2015], effective when ε is fixed andδgoes to zero.

• Due to the absence of the rewards in RFE, known techniques to get the optimal dependence in the horizonH [Azar et al.,2012,Zanette and Brunskill,2019] do not apply. We therefore develop a new analysis that relies on the use of exploration bonuses scaling with1/ninstead of the standard1/√

n.

1Jin et al.[2018] prove that this is indeed the case forregretlower bounds.

2Azar et al. [2012], express both the upper and the lower bounds in the total number of calls to the generative model (instead of trajectories)and prove them for theγ-discounted infinite horizon. They are of orderSA(1−γ)⁻³ε⁻²log(1/δ). We translate them to the episodic setting by replacing(1−γ)⁻¹by the horizonH. In particular, the term1/(1−γ)³translates toH³and we includean extraH factor in the upper bounddue to the non-stationary transitions, i.e., when the transition probabilities depend on the stageh∈[H].

3See the last paragraph of Section4.1for a discussion.

4We combined the Ω H²S²/ε²

result ofJin et al. [2020] and theΩ H²SAlog(1/δ)/ε²

result ofDann and Brunskill[2015].

(5)

Algorithm Setting Upper bound (non-stationary case)

Lower bound (stationary case) EmpiricalQVI[Azar et al.,2012]² gen. model ^H⁴_ε^SA₂ log ¹_δ _H³_SA

ε² log ¹_δ

UCBVI+ random recomm.³ BPI ^H³_ε^SA₂ ^log(1/δ)_δ₂

H²SA ε² log ¹_δ BPI-UCRL[Kaufmann et al.,2020] BPI ^H⁴^SA

ε² log ¹_δ +S

BPI-UCBVI(this work) BPI ^H³_ε^SA2 log ¹_δ

UCBZero[Zhang et al.,2020] RFE,N tasks ^H⁵^SA_ε2^log(N)log ¹_δ H²SAlog(N) ε²

RF-RL-Explore[Jin et al.,2020] RFE ^H⁷^S_ε²^Alog³(¹_δ) +^H⁵_ε^S2²^Alog ¹_δ

H²SA ε² log ¹_δ

+S₄ RF-UCRL[Kaufmann et al.,2020] RFE ^H⁴_ε^SA₂ log ¹_δ

+S

RF-Express(this work) RFE ^H³^SA

ε² log ¹_δ +S

Table 1: Best-policy identification (BPI) and reward-free exploration (RFE) algorithms with their respective upper bounds on the sample complexity, expressed in terms of the number of trajectories required by the algorithms.² The factors and terms that are poly-log inS, A, H, ε, and log(1/δ) are omitted.

2 Setting

We consider a finite episodic MDP S,A, H,{ph}_h∈[H],{rh}_h∈[H]

, where S is the set of states, A is the set of actions, H is the number of steps in one episode, ph(s⁰|s, a) is the probability transition from state s to state s⁰ by taking the action a at step h, and rh(s, a) ∈ [0,1] is the bounded deterministic reward received after taking the actiona in state s at step h. Note that we consider the general case of rewards and transition functions that are possibly non-stationary, i.e., that are allowed to depend on the decision stephin the episode. We denote by S andA the number of states and actions, respectively.

Learning problem The agent, to which the transitions are unknown, interacts with the environment in episodes of length H, with a fixed initial states1.⁵ At each step h∈ [H], the agent observes a state sh ∈ S, takes an action ah ∈ A and makes a transition to a new state sh+1

according to the probability distribution ph(sh, ah). In BPI, the agent receives a deterministic reward r_h(s_h, a_h) at each step h, for fixed reward functions r ,{rh}_h∈[H], and it is required to output anε-optimal policywith respect tor. In RFE, no rewards are observed during exploration, and the agent is required to output an estimate of the transition probabilities which can be used afterwards to compute anε-optimal policy forany reward function.

Policy & value functions A deterministic policyπis a collection of functionsπh:S 7→ Afor allh∈[H], where everyπh maps each state to asingle action. The value functions ofπ, denoted byV_h^π, are defined as

V_h^π(s),E

" _H X

h⁰=h

r_h⁰(s_h⁰, a_h⁰)

s_h=s

#

, where a_h⁰ ,π_h⁰(s_h⁰) and s_h⁰₊₁∼p_h⁰(s_h⁰, a_h⁰) forh∈[H]. The optimal value functions are defined as V_h^?(s),maxπV_h^π(s). Both V_h^π andV_h^? satisfy the Bellman equations [Puterman,1994], that are expressed using the Q-value functionsQ_h andQ^?_h in the following way,

V_h^π(s) =πhQ^π_h(s), with Q^π_h(s, a),rh(s, a) +phV_h+1^π (s, a), and V_h^?(s) = max

a Q^?_h(s, a), with Q^?_h(s, a),rh(s, a) +phV_h+1^? (s, a),

5As explained byFiechter[1994] andKaufmann et al.[2020], if the first state is sampled randomly ass1∼p0, we can simply add an artificial first states0such that for any actiona, the transition probability is defined as the distributionp0(s0, a),p0.

(6)

where by definition, V_H+1^? , V_H+1^π , 0. Furthermore, phf(s, a) , Es⁰∼p_h(·|s,a)[f(s⁰)] denotes the expectation operator with respect to the transition probabilitiesph and (πhg)(s),πhg(s), g(s, πh(s))denotes the composition with the policyπat steph.

Empirical MDP Let(sⁱ_h, aⁱ_h, sⁱ_h+1)be the state, the action, and the next state observed by an algorithm at stephof episode i. For any step h∈[H] and any state-action pair (s, a)∈ S × A, we letn^t_h(s, a),Pt

i=11

(sⁱ_h, aⁱ_h) = (s, a) be the number of times the state action-pair(s, a)was visited in stephin the firsttepisodes andn^t_h(s, a, s⁰),Pt

i=11

(sⁱ_h, aⁱ_h, sⁱ_h+1) = (s, a, s⁰) . These definitions permit us to define the empirical transitions as

pb_h^t(s⁰|s, a), n^t_h(s, a, s⁰)

n^t_h(s, a) ifn^t_h(s, a)>0 and pb_h^t(s⁰|s, a), 1

S otherwise.

Based on the empirical transitions and on exploration bonuses, we introduce various data-dependent quantities that are useful for designing algorithms for either the BPI or the RFE objective. While the former allows the agent to access the reward function r during exploration, the latter does not. Therefore, in all data-dependent quantities introduced in Section 3 to design a RFE algorithm, we always materialize a possible dependency onr. In particular, we denote by Vb_h^t,π(s;r) andQb^t,π_h (s, a;r)the value and the action-value functions of a policyπin the MDP with transition kernelspb^tand reward functionr. In Section4, in which the reward function is fixed, we drop the dependency onrand use the simpler notationVb_h^t,π(s)andQb^t,π_h (s, a).

3 Reward-free exploration

In this section, we consider reward-free exploration (RFE) where the agent does not observe the rewards during the exploration phase. Again, as the value functions defined in Section2depend on a reward function r, we sometimes use the notation Vh(s;r) and Qh(s, a;r) instead ofVh(s) andQh(s, a).

Reward-free exploration In RFE , the agent interacts with the MDP in the following way.

At the beginning of the episode t, the agent decides to follow a policy π^t, called the sampling rule, based only on the data collected up to episode t−1. Then, a reward-free episode z^t , (s^t₁, a^t₁, s^t₂, a^t₂, . . . , s^t_H, a^t_H)is generated starting from the the initial states^t₁,s₁by taking actions a^t_h = π_h^t(s^t_h) and, for h > 1, observing next-states according to s^t_h ∼ p_h(s^t_h−1, a^t_h−1). This new trajectory is added to the dataset Dt ,D_t−1∪ {zt}. At the end of each episode, the agent can decide to stop collecting data, according to a random stopping time τ and outputs an empirical transition kernelpbbuilt with the datasetDτ.

Any RFE agent is therefore made of a triple((π^t)_t∈N^?, τ,pb). Our goal is to design an agent that is(ε, δ)-PAC,probably approximately correct, according to the following definition, for which the number of exploration episodesτ, i.e., thesample complexity, is as small as possible.

Definition 1(PAC algorithm for RFE). An algorithm is(ε, δ)-PAC for reward-free exploration if P

for any reward functionr, V₁^?(s1;r)−V^b^π

? r

1 (s1;r)≤ε

≥1−δ,

whereπb^?_r is the optimal policy in the empirical MDP whose transitions are given by the transition kernelbpreturned by the algorithm and whose reward function isr.

3.1 RF-Express

In this section, we present theRF-Express algorithm along with a high-probability bound on its sample complexity. RF-Express relies on upper bounds on the estimation error between the true value functions and their empirical counterparts. We start with the motivation for the design choices forRF-Express. We then introduce quantities to which we engrave our choices and which we subsequently use in the definition algorithm. In the algorithmic template we proceed asFiechter [1994] andKaufmann et al.[2020] by upper bounding the estimation-error for all the policieswith the striking difference that we only upper bound it at the initial state. We finish this part by providing more intuition and discussion, in particular, we provide technical insights into what RF-Expressis optimizing and thenexplain our1/nversus 1/√

nexploration bonuses, the reasons for choosing them and the challenge with analysing them.

(7)

Estimation error Given a policyπand an arbitrary reward functionr, we define the estimation error as the absolute difference between the Q-value ofπin the empirical MDP and its Q-value in the true MDP. Precisely, after episodet, for all(s, a, h), we define

be_h^t,π(s, a;r),

bQ^t,π_h (s, a;r)−Q^π_h(s, a;r) .

To control the approximation error of the value ofany policy forany reward functionstarting from the initial states₁, we introduce the functionsW_h^t(s, a)defined inductively byW_H+1^t (s, a),0for all(s, a)∈ S × Aand for allh∈[H] and(s, a)∈ S × A,

W_h^t(s, a),min H,9H²β(n^t_h(s, a), δ) n^t_h(s, a) +

1 + 1

H

X

s⁰

pb_h^t(s⁰|s, a) max

a⁰ W_h+1^t (s⁰, a⁰)

!

, (1) whereβ(n^t_h(s, a), δ)is a threshold that depends on how we build the confidence sets for the transitions probabilities. Notice that theW_h^t are all independent of the reward functionr. As shown in the next lemma with proof in AppendixC, the functionW₁^t(s1, a)can be used to upper bound the estimation error of any policy under any reward function in the initial states1.

Lemma 1. With probability at least 1−δ, for any episodet, policy π, and reward functionr, be₁^t,π(s1, π1(s1);r)≤2eq

maxa∈AW₁^t(s1, a) + max

a∈AW₁^t(s1, a).

In particular, the above holds on the eventF defined in AppendixA.

With all the above definitions, we are now ready to outline ourRF-Express algorithm.

sampling rule: the policyπ^t+1 is the greedy policy with respect toW_h^t,

∀s∈ S,∀h∈[H], π^t+1_h (s) = arg max

a∈A

W_h^t(s, a) stopping rule: τ= inf

t∈N: 2e q

π₁^t+1W₁^t(s1) +π^t+1₁ W₁^t(s1)≤ε/2

prediction rule: output the empirical transition kernelpb=pb^τ

Next, we provide a bound on the sample complexity ofRF-Express with a proof in AppendixC.

Theorem 1. For δ ∈ (0,1), ε ∈ (0,1], RF-Express with threshold β(n, δ) , log(3SAH/δ) + Slog(8e(n+ 1)) is (ε, δ)-PAC for reward-free exploration. Moreover, RF-Express stops after τ steps where, with probability at least1−δ,

τ≤ H³SA

ε² (log(3SAH/δ) +S)C1+ 1 and whereC₁,1650e⁴log e¹⁸(log(3SAH/δ) +S)H³SA/ε

.

As a consequence, the sample complexity of RF-Expressis of order Oe H³SA(log(1/δ) +S) matches the lower bound of Ω(H²S²A/ε²) of Jin et al.[2020] up to a factor of H and poly-logand terms. This lower bound is informative in the regime whereδis considered as fixed andεtends to zero. Moreover, up to a factorH, our result also matches the lower bound ofΩ H²SAlog(1/δ)/ε² given byDann and Brunskill[2015] which is informative in the regime whereεis fixed andδtends to0. In Remark 1, we discuss the optimality of RF-Expresswith respect toH. As we see in the next section, the quadratic dependence onS can be avoided for BPI.

What isRF-Expressoptimizing? Contrary toRF-UCRLofKaufmann et al.[2020],RF-Express doesnot build upper bounds on all estimation errorsbe_h^t,π(s, a;r) for allh∈[H] but only for the one at the initial statebe₁^t,π(s₁, π₁(s₁);r). Moreover, the upper bound is not W₁^t(s₁, a)itself, but a function of this quantity, as can be seen in Lemma1. Hence, if RF-Express actually follows the

(8)

optimism-in-the-face-of-uncertainty principle, what quantities areW_h^tupper bounding? To answer this question and provide an intuition on the sampling rule ofRF-Express, fix a policyπand letP^π be the probability distribution governing a random trajectory(s1, a1, s1, a2, . . . , sH, aH)∼P^π in the MDP. Next, letPb^t,πbe the probability distribution of a trajectory(s^t₁, a^t₁, s^t₁, a^t₂, . . . , s^t_H, a^t_H)∼ Pb^t,π in the empirical MDP built using the dataset Dt at episodet. Assuming that all the state action pairs have been visited at least once at timet, using the chain rule (seeGarivier et al.,2019) we can compute the Kullback-Leibler divergence between these two probability distributions as

KL(Pb^t,π, P^π) =

H

X

h=1

X

s,a

pb_h^t,π(s, a) KL pb_h^t,π(s, a), ph(s, a) ,

where pb_h^t,π(s, a) is the probability to reach state-action (s, a) at step h under policy π in the empirical MDP in episodet. Notice now that the bonus of the formβ(n, δ)/nused to defineW^tis by design chosen to be an upper-confidence bound on the Kullback-Leibler divergence between the empirical transition probability and the transition probability. Indeed, in Appendix A we show that with high probability, for all(s, a)∈ S × A andh∈[H],

KL pb_h^t(s, a), p_h(s, a)

≤β(n^t_h(s, a), δ) n^t_h(s, a) · Therefore, omitting the clipping toH in (1), we have that

maxπ KL(Pb^t,π, P^π).H²π₁^t+1W₁^t(s₁).

Therefore,RF-Expresscan be interpreted as an algorithm minimizing an upper-confidence bound on the loss ofmax_πKL(Pb^t,π, P^π), which requires bonuses of the formβ(n, δ)/ninstead ofp

β(n, δ)/n. Notice that this loss is of the same flavor as the one introduced byHazan et al.[2019].

Bonuses of 1/n versus 1/√

n Our approach differs from the bonuses typically used in regret minimization (e.g., Azar et al., 2017) and in prior work in reward-free exploration [Kaufmann et al.,2020,Zhang et al.,2020], which uses bonuses proportional top

1/n(s, a). Intuitively, since 1/n-bonuses decay faster withn, our algorithm is more exploratory: once a state-action pair(s, a) has been visited, the bonus associated to it will be more strongly reduced than if we usedp

1/n- bonuses and the algorithm tends to visit other state-action pairs before returning to(s, a)again.

Technically, this might seem very surprising. Indeed, if we want to estimate the meanµof a random variableX with an estimator bµn computed withni.i.d. samples fromX, the error|µ−µbn| scales with p

1/n by Hoeffding’s inequality, which explains the shape of the bonuses used in previous works. However, instead of bounding the error|µ−bµn|, our concentration inequalities based on the KL divergence give us a bound on the quadratic term(µ−µbn)²,which scales with 1/n. This allows us to use a Bellman-type equation for the variance of the value functions and reduce the sample complexity by a factor ofH, similarly to previous work on regret minimization [Azar et al., 2017]. The main challenge in our case is that, in reward free exploration, we need to upper bound the sum of variances forany possible value function, which makes this technique considerably more challenging to analyze than for the regret minimization.

3.2 Proof sketch

We first sketch the proof of Lemma1. We begin as it is done in the analysis of Kaufmann et al.

[2020]. For a fixed policyπand an arbitrary reward functionr, we decompose the estimation error of the Q-value function ofπat the state-action pair(s, a)as, for all reward functionr,

be_h^t,π(s, a;r)≤

bQ_h^t,π(s, a;r)−Q^π_h(s, a;r) ≤

(bp_h^t−p_h)V_h+1^π (s, a)

+pb_h^t|Vb_h+1^t,π −V_h+1^π |(s, a)

=

(pb_h^t−ph)V_h+1^π (s, a)

+pb_h^tπ_h+1^t be_h+1^t,π (s, a;r).

Similarly toAzar et al.[2017] andZanette and Brunskill[2019], to obtain the optimal dependency with respect to the horizonH, we would like to apply the Bernstein inequality to control the first term. Since we need to do it for all value functions of all policies, we could use a covering of this function space and conclude with a union bound, seeDomingues et al.[2020]. Instead we show,

(9)

via Lemma10 in Appendix E.3, that from a control of the deviations of the empirical transition probabilities such that

KL pb_h^t(s, a), p_h(s, a)

≤β(n^t_h(s, a), δ) n^t_h(s, a) with high probability, we deduce an empirical Bernstein inequality,

be_h^t,π(s, a;r)≤2 s

Var

pb_h^t(Vb_h+1^t,π)(s, a;r) H²

H²β(n^t_h(s, a), δ) n^t_h(s, a) ∧1

+ 9H²β(n^t_h(s, a), δ) n^t_h(s, a) +

1 + 1

H

pb_h^tπb_h+1^t be_h+1^t,π (s, a;r),

where the variance ofVb_h+1^t,π,in particular with respect to pb_h^t(·|s, a)is defined as Var

bp_h^t(Vb_h+1^t,π)(s, a;r) =X

s⁰

bp_h^t(s⁰|s, a)

Vb_h+1^t,π(s⁰;r)−Ez∼pb_h^t(·|s,a)

h

Vb_h+1^t,π(z;r)i2

. Therefore, definingZ_H+1^t,π (s, a;r),0 and recursively the functions

Z_h^t,π(s, a;r),min H,2 s

Var

pb_h^t(bV_h+1^t,π)(s, a;r) H²

H²β(n^t_h(s, a), δ) n^t_h(s, a) ∧1

+ 9H²β(n^t_h(s, a), δ) n^t_h(s, a) +

1 + 1

H

pb_h^tπh+1Z_h+1^t,π (s, a;r)

! , we prove by induction that for all(s, a, h),

be_h^t,π(s, a;r)≤Z_h^t,π(s, a;r). (2) We nowsplit Z^t,π in two terms. The first term is the one with the bonus inp

1/nand the second one with the bonus in 1/n. Precisely, for all (s, a), we define recursively two other quantities Y_H+1^t,π (s, a;r),W_H+1^t,π (s, a),0 and

Y_h^t,π(s, a;r),2 s

Var

pb_h^t(Vb_h+1^t,π)(s, a;r) H²

H²β(n^t_h(s, a), δ) n^t_h(s, a) ∧1

+

1 + 1

H

pb_h^tπh+1Y_h+1^t,π(s, a;r) W_h^t,π(s, a),min

H,9H²β(n^t_h(s, a), δ) n^t_h(s, a) +

1 + 1

H

pb_h^tπ_h+1W_h+1^t,π (s, a)

.

We then prove by induction (AppendixC, Step 2 of the proof of Lemma1) that for allh, s, a, Z_h^t,π(s, a;r)≤Y_h^t,π(s, a;r) +W_h^t,π(s, a). (3) Note that althoughZ_h^t,π(·;r) is a high-probability upper bound onbe_h^t,π(·;r), we cannot use it to build a sampling rule reducing the errors as it still depends on the reward functionrthrough the empirical variance term, and this knowledge is only available in the planning phase. To obtain an upper bound onZ_h^t,π(·;r)which does not depend onr, we now further upper-boundY^t,π(·;r). The key tool for this purpose is to use the Bellman equation for the variances, see AppendixE.1.

We denote by pb_h^t,π(s, a) the probability of reaching the state-action pair (s, a) at step h under the policy π in the empirical MDP at time t. Using Cauchy-Schwarz inequality, Lemma 7 in AppendixE.1, and the fact that that variance of the sum of reward is upper bounded by H², we

(10)

get

π1Y₁^t,π(s1;r) = 2X

s,a H

X

h=1

pb^t,π_h (s, a)

1 + 1 H

h−1s

Varbp_h^t(Vb_h+1^t,π)(s, a;r) H²

H²β(n^t_h(s, a), δ) n^t_h(s, a) ∧1

≤2e v u u t

X

s,a H

X

h=1

pb_h^t,π(s, a)Var

pb_h^t(bV_h+1^t,π)(s, a;r) H²

v u u t

X

s,a H

X

h=1

pb_h^t,π(s, a)

H²β(n^t_h(s, a), δ) n^t_h(s, a) ∧1

≤2e v u u u t

1 H²Eπ,pb_h^t





H

X

h=1

r_h(s_h, a_h)−Vb₁^π(s₁;r)

!²

 v u u t

X

s,a H

X

h=1

pb_h^t,π(s, a)

H²β(n^t_h(s, a), δ) n^t_h(s, a) ∧1

≤2e v u u t

X

s,a H

X

h=1

pb_h^t,π(s, a)

H²β(n^t_h(s, a), δ) n^t_h(s, a) ∧1

≤2e q

W₁^t,π(s1), where the last inequality is proved in AppendixC(Step 3 of the proof of Lemma 1).

Combining inequalityπ₁Y₁^t,π(s₁;r)≤2e q

W₁^t,π(s₁)with (2) and (3) yields that, for allπ,

be₁^t,π(s1, π1(sa);r)≤2e q

π1W₁^t,π(s1) +π1W₁^t,π(s1).

Finally, we note that by construction,π₁W₁^t,π(s₁)≤max_a∈AW₁^t(s₁, a), which allows us to conclude the proof of Lemma1.

Next, we sketch the proof of Theorem1. The fact that RF-Express is (ε, δ)-PAC is a simple consequence of Lemma1. Indeed, on an event of probability at least1−δ, if the algorithm stops at timeτ we know that for all policyπand for all reward functionr,

ε

2 ≥2eq

maxa∈AW₁^τ(s1, a) + max

a∈AW₁^τ(s1, a)≥be₁^τ,π(s1, π1(s1);r) =|Vb₁^τ,π(s1;r)−V₁^π(s1;r)|.

Therefore, still on the same event it holds that

V₁^?(s₁;r)−V₁^b^π^?(s₁;r) =V₁^?(s₁;r)−Vb₁^τ,π^r^?(s₁;r) +Vb₁^τ,π^?^r(s₁;r)−Vb₁^τ,^b^π^?^r(s₁;r) +Vb₁^τ,^b^π^?^r(s₁;r)−V₁^π^b^r^?(s₁;r)

≤ |V₁^π^r^?(s1;r)−Vb^τ,π

? r

1 (s1;r)|+|Vb^τ,^b^π

? r

1 (s1;r)−V^π^b

? r

1 (s1;r)| ≤ε .

The proof of the bound on the sample complexity is close to the one of a regret bound. We fix a timet < τ. We start by proving an upper-bound on W₁^t(s1, π^t+1(s1)). For that using again the empirical Bernstein inequality of Lemma10, with high probability, it holds that

W_h^t(s, a)≤14H²

β(n^t_h(s, a), δ) n^t_h(s, a) ∧1

+

1 + 3

H

phπ^t+1_h+1W_h+1^t (s, a).

We denote byp^t_h(s, a)the probability to reach the state-action pair(s, a)at stephunder policyπ^t in the true MDP. Unfolding the previous inequality and switching to the pseudo-counts, defined byn¯^t_h(s, a),Pt

`=1p^`_h(s, a), by Lemma8proved in AppendixE.2we get π₁^t+1W₁^t(s1)≤48e³H²

H

X

h=1

X

s,a

p^t+1_h (s, a)β(¯n^t_h(s, a), δ)

¯

n^t_h(s, a)∨1 · (4)

Sincet < τ we know that due to stopping rule ε≤2e

q

π^t+1₁ W₁^t(s1) +π₁^t+1W₁^t(s1).

Summing the previous inequalities for 0 ≤ t < τ then using the Cauchy-Schwarz inequality we obtain

τ ε≤

τ−1

X

t=0

2e q

π^t+1₁ W₁^t(s1) +π₁^t+1W₁^t(s1)≤2e v u u tτ

τ−1

X

t=0

π^t+1₁ W₁^t(s1) +

τ−1

X

t=0

π^t+1₁ W₁^t(s1).

(11)

We now upper-bound the sum that appears in the left-hand terms. Using successively (4),β(·, δ) is increasing, Lemma9of AppendixE.2, we have

τ−1

X

t=0

π₁^t+1W₁^t(s1)≤48e³H²

τ−1

X

t=0 H

X

h=1

X

s,a

p^t+1_h (s, a)β(¯n^t_h(s, a), δ)

¯

n^t_h(s, a)∨1

≤48e³H²β(τ−1, δ)

H

X

h=1

X

s,a τ−1

X

t=0

¯

n^t+1_h (s, a)−n¯^t_h(s, a)

¯

n^t_h(s, a)∨1

≤192e³H³SAlog(τ+ 1)β(τ−1, δ).

Therefore, combining with the above inequality with the previous one, we get τ ε≤30e³p

τ H³SAlog(τ+ 1)β(τ−1, δ) + 192e³H³SAlog(τ+ 1)β(τ−1, δ).

Using Lemma12, we invert the inequality above and obtain an upper bound onτ, which allows us to conclude the proof of the theorem.

4 Best-policy identification

Unlike in the previous section, we now consider a more standard setup in which there is a single reward functionrand in which the agent observes the reward at each step, during the exploration phase. To ease the presentation, we drop the dependence on the rewardr in all data-dependent quantities introduced in this section.

Best-policy identification In BPI, the agent interacts with the MDP in a way described in Section 2. Notice that the difference from Section 3 is that the agent also observes the reward.

In each episodet, the agent follows a policyπ^t(the sampling rule) based only on the information collected up to and including episodet−1. At the end of each episode, the agent can decide to stop collecting data (we denote byτ its random stopping time) and outputs a guessπb for the optimal policy.

A BPI algorithm is therefore made of a triple((π^t)_t∈_N, τ,bπ). The goal is to build an(ε, δ)-PAC algorithm according to the following definition, for which thesample complexity, that is the number of exploration episodesτ, is as small as possible.

Definition 2 (PAC algorithm for BPI). An algorithm is(ε, δ)-PAC for best policy identification if itreturns a policy bπafter some number of episodesτ that satisfies

P

V₁^?(s1)−V₁^b^π(s1)≤ε

≥1−δ.

4.1 BPI-UCBVI

Similarly toAzar et al.[2017] andZanette and Brunskill[2019], we define upper confidence bounds on the optimal Q-value and value functions as

Qe^t_h(s, a),min H, rh(s, a) + 2 s

Var

bp_h^t(eV_h+1^t )(s, a)β^?(n^t_h(s, a), δ)

n^t_h(s, a) + 10H²β(n^t_h(s, a), δ) n^t_h(s, a) + 1

Hpb_h^t(eV_h+1^t −

˜V_h+1^t )(s, a) +pb_h^tVe_h+1^t (s, a)

! , Ve_h^t(s),max

a∈AQe^t_h(s, a), Ve_H+1^t (s),0,

whereβ^? is some exploration rate (that doesnot scale with the number of statesS) and

˜V^tis a lower confidence bound on the optimal value function; see AppendixBfor a complete definition.

As in RFE, we need to build an upper confidence bound on the gapV₁^?(s₁)−V₁^π^t+1(s₁),between the value of the optimal policy and the value of the current policy, to define the stopping rule. We

(12)

recursively define the functionsG^tas G_H+1^t (s, a),0for all(s, a)and for all(s, a, h)as

G_h^t(s, a),min H,4 s

Var

pb^t_h(Ve_h+1^t )(s, a)β^?(n^t_h(s, a), δ)

n^t_h(s, a) + 25H²β(n^t_h(s, a), δ) n^t_h(s, a) +

1 + 3

H

pb_h^tπ_h+1^t+1G^t_h+1(s, a)

!

forh≤H.

We prove the following result in AppendixD.

Lemma 2. With probability at least 1−δ, for allt,

V₁^?(s1)−V₁^π^t+1(s1)≤π^t+1₁ G^t₁(s1). In particular it holds on the eventG defined in AppendixA.

We are now ready to define ourBPI-UCBVIalgorithm.

sampling rule: the policyπ^t+1 is the greedy policy with respect toQe^t_h,

∀s∈ S,∀h∈[H], π_h^t+1(s) = arg max

a∈A

Qe^t_h(s, a) stopping rule: τ= inf

t∈N:π^t+1₁ G^t₁(s1)≤ε prediction rule: bπ=π^τ+1

We provide a sample complexity bound for BPI-UCBVI in the next theorem, which we prove in AppendixD.

Theorem 2. Forδ∈(0,1), ε∈(0,1/S²],BPI-UCBVI using thresholds β(n, δ),log 3SAH/δ + Slog 8e(n+ 1)

andβ^?(n, δ),log(3SAH/δ) + log 8e(n+ 1)

is(ε, δ)-PAC for best policy exploration. Moreover, with probability1−δ,

τ≤ H³SA

ε² log(3SAH/δ) + 1 C1+ 1, whereC₁,147e²²log e²⁵(log(3SAH/δ) +S)H³SA/ε

.

Therefore, the rate of BPI-UCBVI is of order Oe H³SAlog(1/δ)/ε²

when ε is small enough and matches the lower bound ofΩ H²SAlog(1/δ)/ε²byDann and Brunskill [2015] up to anH and poly-log terms. To the best of our knowledge, BPI-UCBVI is the first algorithm for BPI whose sample complexity has an optimal dependence on S, A, ε, and δ. In Remark 1, we discuss the optimality of BPI-UCBVIwith respect toH.

From regret-minimization to BPI The main difficulty for converting a regret-minimizing algorithm to BPI lies in high-probability prediction of anε-optimal policy. Indeed, assume that a regret-minimizing algorithm generates a sequence of policy (π^t)_t∈[T_] for a number of episodes T ∈Nlarge enough, with a controlled regret such that with probability at least1−δ⁰,

T

X

t=1

V^?(s1)−V₁^π^t(s1)≤Cp

H³SAlog(1/δ⁰)T ,

for some constantC. Then a straightforward choice for the prediction is to choose bπ at random among the sequence(π^t)_t∈[T_], as suggested byJin et al.[2018]. Markov’s inequality implies that

P

V₁^?(s1)−V₁^π^t(s1)≥ε

≤ 1 εE

"

1 T

T

X

t=1

V^?(s1)−V₁^π^t(s1)

#

≤ 1 ε C

rH³SA

T log(1/δ⁰) +δ⁰H

! .

(13)

Therefore, if we chooseδ⁰, H^ε · ^δ₂ and

T , 2H³SA ε²δ² log

2H εδ

,

we getP V^?(s1)−V₁^π^t(s1)≥ε

≤δ which implies that the described algorithm is(ε, δ)-PAC for BPI. Unfortunately, the choice ofT above gives a sample complexity that scales with1/δ²whereas we prefer and expectlog(1/δ).

Another natural choice for the prediction is to return the average policy. Precisely, let us define

¯

p_h^t(s, a),Pt

`=1p^`_h(s, a)be the average state-action distribution, we define the average distribution as the one for which its state-action distribution is preciselyp¯^t,

¯

π^t_h(a|s), ( _p_¯^t

h(s,a) P

b∈Ap¯^t_h(s,b) if P

b∈Ap¯^t_h(s, b)>0, and

1/A otherwise.

Note that the average policyπ¯^t_h(a|s)is not the average of the policiesPt

`=1π^`_h(a|s)/t. The interest of¯π_h^t is that the average regret at timeT isequal to the simple regret of the policyπ¯^T,

V₁^?(s₁)−V₁^π^¯^T(s₁)≤ 1 T

T

X

t=1

V^?(s₁)−V₁^π^t(s₁)≤C

rH³SA

T log(1/δ⁰),

with probability1−δ⁰. Choosing δ⁰ ,δ andT ,CH³SAlog(1/δ)/ε² and predicting policy π¯^T would lead to an(ε, δ)-PAC algorithm for BPI with a minimax optimal sample complexity. How- ever, since the agent does not know the transition kernel, it cannot compute the average policyπ¯^T! Therefore,BPI-UCBVIrather relies on an upper bound on the simple regret (see Lemma 2) com- putable by the agent to detect if the policy currently used by the sampling rule isε-optimal.

Acknowledgments The research presented was supported by European CHIST-ERA project DELTA, French Ministry of Higher Education and Research, Nord-Pas-de-Calais Regional Coun- cil, French National Research Agency project BOLD (ANR19-CE23-0026-04). Anders Jonsson is partially supported by the Spanish grants TIN2015-67959 and PCIN-2017-082.

References

Alekh Agarwal, Sham Kakade, and Lin F Yang. Model-based reinforcement learning with a generative model is minimax optimal. InConference on Learning Theory, 2020.

Mohammad Gheshlaghi Azar, Rémi Munos, and Bert Kappen. On the sample complexity of reinforcement learning with a generative model. In International Conference on Machine Learning, 2012.

Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J. Kappen.Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learning, 91(3):

325–349, 2013.

Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. InInternational Conference on Machine Learning, 2017.

Peter L. Bartlett and Ambuj Tewari.REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. InUncertainty in Artificial Intelligence, 2009.

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities. Oxford University Press, 2013.

Thomas M. Cover and Joy A. Thomas.Elements of information theory. John Wiley & Sons, 2006.

Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. InNeural Information Processing Systems, 2015.

Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In Neural Information Processing Systems, 2017.