Classification-based Policy Iteration with a Critic

(1)

HAL Id: hal-00590972

https://hal.archives-ouvertes.fr/hal-00590972

Submitted on 5 May 2011

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Classification-based Policy Iteration with a Critic

Victor Gabillon, Alessandro Lazaric, Mohammad Ghavamzadeh, Bruno

Scherrer

To cite this version:

Victor Gabillon, Alessandro Lazaric, Mohammad Ghavamzadeh, Bruno Scherrer. Classification-based

Policy Iteration with a Critic. 2011. �hal-00590972�

(2)

Classification-based Policy Iteration with a Critic

Victor Gabillon victor.gabillon@inria.fr

Alessandro Lazaric alessandro.lazaric@inria.fr

Mohammad Ghavamzadeh mohammad.ghavamzadeh@inria.fr

INRIA Lille - Nord Europe, Team SequeL, FRANCE

Bruno Scherrer bruno.scherrer@inria.fr

INRIA Nancy - Grand Est, Team Maia, FRANCE

Abstract

In this paper, we study the effect of adding a value function approximation component (critic) to rollout classification-based policy iteration (RCPI) algorithms. The idea is to use the critic to approximate the return af-ter we truncate the rollout trajectories. This allows us to control the bias and variance of the rollout estimates of the action-value func-tion. Therefore, the introduction of a critic can improve the accuracy of the rollout esti-mates, and as a result, enhance the perfor-mance of the RCPI algorithm. We present a new RCPI algorithm, called direct policy iter-ation with critic (DPI-Critic), and provide its finite-sample analysis when the critic is based on LSTD and BRM methods. We empirically evaluate the performance of DPI-Critic and compare it with DPI and LSPI in two bench-mark reinforcement learning problems.

1. Introduction

Policy iteration is a method of computing an opti-mal policy for any given Markov deci- sion process (MDP). It is an iterative procedure that discovers a deterministic optimal policy by generating a se-quence of monotonically improving policies. Each it-eration k of this algorithm consists of two phases: policy evaluation in which the action-value function Qπk _{of the current policy π}

k is computed, and icy improvement in which the new (improved) pol-icy πk+1is generated as the greedy policy w.r.t. Qπk, i.e., πk+1(x) = arg maxa∈AQπk(x, a). Unfortunately, in MDPs with large (or continuous) state and/or ac-tion spaces, the policy evaluaac-tion problem cannot be

Appearing in Proceedings of the 28th _{International}

solved exactly and approximation techniques are re-quired. There have been two main approaches to deal with this issue in the literature. The most common ap-proach is to find a good approximation of the action-value function of πk in a real-valued function space (see e.g., Lagoudakis & Parr 2003a). The second ap-proach 1) replaces the policy evaluation step (approxi-mating the action-value function over the entire state-action space) with computing rollout estimates of Qπ over a finite number of states D = {xi}Ni=1, called the rollout set, and the entire action space, and 2) casts the policy improvement step as a classification problem to find a policy in a given hypothesis space that best predicts the greedy action at every state (see e.g., Lagoudakis & Parr 2003b; Fern et al. 2004;

Lazaric et al. 2010b). Although whether selecting a

suitable policy space is any easier than a value func-tion space is highly debatable, it may be argued that classification-based API methods can be advantageous in problems where good policies are easier to represent and learn than their value functions.

As it is suggested by both theoretical and empirical analysis, the performance of the classification-based API algorithms is closely related to the accuracy in es-timating the greedy action at each state of the rollout set, which itself depends on the accuracy of the rollout estimates of the action-values. Thus, it is quite im-portant to balance the bias and variance of the rollout estimates, bQπ_{’s, that both depend on the length H of} the rollout trajectories. While the bias in bQπ_{, i.e., the} difference between bQπ _{and the actual Q}π_{, decreases as} H becomes larger, its variance (due to stochastic MDP transitions and rewards) increases with the value of H. Although the bias and variance of bQπ_{estimates may be} optimized by the value of H, when the budget, i.e., the number of calls to the generative model, is limited, it may not be possible to find an H that guarantees an accurate enough training set.

(3)

A possible approach to address this problem is to in-troduce a critic that provides an approximation of the value function. In this approach, we define each bQπ estimate as the average of the values returned by H-horizon rollouts plus the critic’s prediction of the re-turn from the time step H on. This allows us to use small values of H, thus having a small estimation vari-ance, and at the same time, to rely on the value func-tion approximafunc-tion provided by the critic to control the bias. The idea is similar to actor-critic meth-ods (Barto et al., 1983) in which the variance of the gradient estimates in the actor is reduced using the critic’s prediction of the value function.

In this paper, we introduce a new classification-based API algorithm, called DPI-Critic, obtained by adding a critic to the direct policy iteration (DPI) algorithm (Lazaric et al., 2010b). We provide finite-sample analysis for DPI-Critic when the critic approx-imates the value function using least-squares temporal-difference (LSTD) learning (Bradtke & Barto, 1996). The finite-sample analysis of DPI-Critic with Bell-man residual minimization is available in sec-tion 7.1. We empirically evaluate the perfor-mance of DPI-Critic and compare it with DPI and LSPI (Lagoudakis & Parr, 2003a) on two benchmark reinforcement learning (RL) problems: mountain car and inverted pendulum. The results indicate that DPI-Critic can take advantage of both its components and improve over DPI and LSPI.

2. Preliminaries

In this section we set the notation used throughout the paper. For a measurable space with domain X , we let S(X ) and B(X ; L) denote the set of probability mea-sures over X , and the space of bounded measurable functions with domain X and bound 0 < L < ∞, re-spectively. For a measure ρ ∈ S(X ) and a measurable function f : X → R, we define the ℓp(ρ)-norm of f as ||f||p

p,ρ= R

|f(x)|p_{ρ(dx). We consider the standard RL} framework (Sutton & Barto, 1998) in which a learn-ing agent interacts with a stochastic environment and this interaction is modeled as a discrete-time MDP. A discounted MDP is a tuple M = hX , A, r, p, γi, where the state space X is a subset of a Euclidean space Rd_{, the set of actions A is finite (|A| < ∞),} the reward function r : X × A → R is uniformly bounded by Rmax, the transition model p(·|x, a) is a distribution over X , and γ ∈ (0, 1) is a discount factor. We define deterministic policies as the map-ping π : X → A. The value function of a policy π, Vπ_{, is the unique fixed-point of the Bellman operator} Tπ_{: B(X ; V}

max=R_1−γmax) → B(X ; Vmax) defined by

(TπV )(x) = r x, π(x)+ γ Z

Xp dy|x, π(x)

V (y) ,

Input: policy space Π, state distribution ρ Initialize: Let π0∈ Π be an arbitrary policy

for k = 0, 1, 2, . . . do

Construct the rollout set Dk= {xi}Ni=1, xiiid∼ ρ

• Critic:

Construct the set Skof n samples (e.g., by following

a trajectory or by using the generative model) b

Vπk _{← VF-APPROX(S}

k) (critic)

• Rollout:

for all states xi∈ Dk and actions a ∈ A do

for j = 1 to M do

Perform a rollout and return Rj(xi, a)

end for b Qπk_(x_i_{, a) =} 1 M PM j=1Rj(xi, a) end for

πk+1= arg minπ∈ΠLbπk(bρ; π) (classifier)

end for

Figure 1.The pseudo-code of the DPI-Critic algorithm.

while the action-value function Qπ _{is defined as}

Qπ(x, a) = r(x, a) + γ Z

X

p(dy|x, a)Vπ(y) .

Since the rewards are bounded by Rmax, all values and action-values are bounded by q = Rmax

1−γ. A policy π is greedy w.r.t. an action-value function Q, if π(x) ∈ arg maxa∈AQ(x, a), ∀x ∈ X .

To approximate value functions, we use a linear ap-proximation architecture with parameters α ∈ Rd_and basis functions ϕj∈ B(X ; L), j = 1, . . . , d. We denote by φ : X → Rd_{, φ(·) = ϕ}

1(·), . . . , ϕd(·) ⊤

the feature vector, and by F the linear function space spanned by the features ϕj, i.e., F = {fα(·) = φ(·)⊤α : α ∈ Rd}. Finally, we define the Gram matrix G ∈ Rd×d _{w.r.t. a} distribution ρ ∈ S(X ) as

Gij=

Z

ϕi(x)ϕj(x)ρ(dx), i, j = 1, . . . , d .

3. The DPI-Critic Algorithm

In this section, we outline the algorithm we propose in this paper, called Direct Policy Iteration with Critic (DPI-Critic), which is an extension of the DPI algo-rithm (Lazaric et al., 2010b) by adding a critic. As illustrated in Fig. 1, DPI-Critic starts with an arbi-trary initial policy π0 ∈ Π. At each iteration k, we build a set of n samples Sk, called the critic training set. The critic uses Skin order to compute bVπk, an ap-proximation of the value function of the current policy πk. Then, a new policy πk+1is computed from πk, as the best approximation of the greedy policy w.r.t. Qπk,

by solving a cost-sensitive classification problem. Sim-ilar to DPI, DPI-Critic is based on the following loss function and expected error :

ℓπk(x; π) = max_a∈AQ

(4)

Lπk(ρ; π) = Z X h max a∈AQ πk_{(x, a) − Q}πk _{x, π(x)}i_{ρ(dx) .}

In order to minimize this loss, a rollout set Dk is built by sampling N states i.i.d. from a distribution ρ. For each state xi∈ Dkand each action a ∈ A, M indepen-dent estimates {Rπk

j (xi, a)}Mj=1are computed, where

Rπk

j (xi, a) = Rπjk,H(xi, a) + γHVbπk(xHi,j) , (1) in which Rπk,H

j (xi, a) is the outcome of an H-horizon rollout, i.e., Rπk,H j (xi, a) = r(xi, a) + H−1_X t=1 γtr(xti,j, πk(xti,j)) , (2) and bVπk_(xH

i,j) is the critic’s estimate of the value func-tion at state xH

i,j. In Eq.2, (xi, x1i,j, x2i,j, . . . , xHi,j) is the trajectory induced by taking action a at state xi and following the policy πk afterwards, i.e., x1i,j∼ p(·|xi, a) and xt

i,j∼ p · |xt−1i,j , πk(xt−1i,j )

for t ≥ 2. An estimate of the action-value function of the policy πk is then obtained by averaging the M estimates as

b Qπk_(x i, a) = 1 M M X j=1 Rπk j (xi, a) . (3) Given the action-value function estimates, the empir-ical loss and empirempir-ical error are defined as

b ℓπk(x; π) = max_a∈AQb πk_{(x, a) − b}_Qπk _{x, π(x)}_{, ∀x ∈ X ,} b Lπk(bρ; π) = 1 N N X i=1 h max a∈AQb πk_(x i, a)− bQπk(xi, π(xi)) i . (4)

Finally, DPI-Critic makes use of a classifier which solves a multi-class cost-sensitive classification prob-lem and returns a policy that minimizes the empirical error bLπk(bρ; π) over the policy space Π.

As it can be seen from Eq. 1, the main difference be-tween DPI-Critic and DPI is that after H steps DPI rollouts are truncated and the return thereafter is im-plicitly set to 0, while in DPI-Critic an approximation of the value function learned by the critic is used to predict this return. Hence, with a fixed horizon H, even if the critic is a rough approximation of the value function, whenever its accuracy is higher than the im-plicit prediction of 0 in DPI, the rollouts in DPI-Critic are expected to be more accurate than those in DPI. Similarly, we expect DPI-Critic to obtain the same ac-curacy as DPI with a shorter horizon, and as a result, a smaller number of interactions with the generative model. In fact, while in DPI decreasing H leads to a smaller variance and a larger bias, in DPI-Critic the increase in the bias is controlled by the critic. Finally, it is worth noting that DPI-Critic still benefits from the advantages of the classification-based approach to policy iteration compared to value-function-based API

algorithms such as LSPI. This is due to the fact that DPI-Critic still relies on approximating the policy im-provement step, and thus similar to DPI, whenever approximating good policies is easier than their value functions, DPI-Critic is expected to perform better than its value-function-based counterparts. Further-more, while DPI-Critic only needs a rough approxi-mation of the value function at certain states, value-function-based API methods, like LSPI, need an accu-rate approximation of the action-value function over the entire state-action space, and thus they usually require more samples than the critic in DPI-Critic.

4. Theoretical analysis

In this section, we provide a finite-sample analysis of the error incurred at each iteration of DPI-Critic. The full analysis of the propagation is reported in section

7.3.

In order to use the existing finite-sample bounds for pathwise-LSTD (Lazaric et al., 2010c), we introduce the following assumptions.

Assumption 1. At each iteration k of DPI-Critic, the critic uses a linear function space F spanned by d bounded basis functions (see Section 2). A data-set Sk = {(Xi, Ri)}ni=1 is built, where Xi’s are obtained by following a single trajectory generated by a stationary β-mixing process with parameters ˆβ, b, κ, and a sta-tionary distribution σk equal to the stationary distri-bution of the Markov chain induced by policy πk, and Ri= r Xi, πk(Xi).

Assumption 2. The rollout set sampling distribution ρ is such that for any policy π ∈ Π and any action a ∈ A, µ = ρPa_(Pπ₎H−1 _{≤ Cσ, where C < ∞ is a} constant and σ is the stationary distribution of π. The distribution µ is the distribution induced by starting at a state sampled from ρ, taking action a, and then following policy π for H − 1 steps.

Before stating the main results of this section, Lemma 1 and Theorem1, we report the performance bound for pathwise-LSTD as inLazaric et al.(2010c). Since all the following statements are true for any it-eration k, in order to simplify the notation, we drop the dependency of all the variables on k.

Proposition 1 (Thm. 5 inLazaric et al. 2010c). Let n be the number of samples collected as in Assumption 1 and bVπ _{be the approximation of the value function of} policy π returned by pathwise-LSTD truncated in the range [−q, q]. Then for any δ > 0, we have

||Vπ− bVπ||2,σ≤ ǫLSTD= " 2 p 1 − γ2 2 √ 2 inf f ∈F||V π − f||2,σ+ E2

(5)

+ 2 1 − γ γqL r 8d ω r 8 log(32|A|d/δ) n + 1 n + E1 #

with probability 1 − δ (w.r.t. the samples in S), where

(1) E1 = 24q q 2Λ1(n,d,δ/4) n max{ Λ1(n,d,δ/4) b , 1}1/κ,

in which Λ1(n, d, δ) = 2(d + 1) log n + loge_δ +

log+_(max{18(6e)2(d+1)_{, ˆ}_β}), (2) E2 = 12(q + L||α∗||) q 2Λ2(n,δ/4) n max{ Λ2(n,δ/4) b , 1}1/κ,

in which Λ2(n, δ) = log_δe + log(max{6, n ˆβ}) and

α∗= arg min_α∈Rd||Vπk− fα∗||2,σ,

(3) ω > 0 is the smallest strictly positive eigenvalue of the Gram matrix w.r.t. the distribution σ.

In the following lemma, we derive a bound for the difference between the actual action-value function of policy π and its estimate computed by DPI-Critic. Lemma 1. Let Assumptions 3 and 4 hold and D = {xi}Ni=1 be the rollout set with xi iid∼ ρ. Let Qπ be the true action-value function of policy π and bQπ _be its estimate computed by DPI-Critic using M rollouts with horizon H (Eqs.1–3). Then for any δ > 0

max a∈A _N1 N X i=1 Qπ(xi, a) − bQπ(xi, a) ≤ ǫ1+ ǫ2+ ǫ3+ ǫ4, with probability 1 − δ (w.r.t. the rollout estimates and the samples in the critic training set S), where

ǫ1= (1 − γH)q r 2 log(4|A|/δ) M N , ǫ2= γ H_q r 2 log(4|A|/δ) M N , ǫ3= 24γHq s 2Λ(N, d, δ 4|A|M) N , ǫ4 = 2γ H√_{C ǫ} LSTD, with Λ(N, d, δ) = log 9e_δ(12N e)2(d+1)_.

Proof. We prove the following series of inequalities:

_N1 N X i=1 Qπ(xi, a) − bQπ(xi, a) (a) = 1 M N N X i=1 M X j=1 Qπ(xi, a) − Rπj(xi, a) (b) ≤_{M N}1 N X i=1 M X j=1 QπH(xi, a) − Rπ,Hj (xi, a) + γ H M N N X i=1 M X j=1 _V_bπ_(xH i,j) − Ex∼νi[V π_(x)] (c) ≤ ǫ1+ γ H M N N X i=1 M X j=1 _b Vπ(xHi,j) − Vπ(xHi,j) + γ H M N N X i=1 M X j=1 Vπ(xHi,j) − Ex∼νi[V π_(x)] w.p. 1 − δ′ (d) ≤ ǫ1+ ǫ2+ γH M M X j=1 ||Vπ− bVπ||1,bµj w.p. 1 − 2δ ′ (e) ≤ ǫ1+ ǫ2+ γH M M X j=1 ||Vπ− bVπ||2,bµj w.p. 1 − 2δ ′ (f) ≤ ǫ1+ ǫ2+ ǫ3+ 2γH||Vπ− bVπ||2,µ w.p. 1 − 3δ′ (g) ≤ ǫ1+ ǫ2+ ǫ3+ 2γH √ C||Vπ− bVπ||2,σ (h) ≤ ǫ1+ ǫ2+ ǫ3+ 2γH √ C ǫLST D w.p. 1 − 4δ′

The statement of the lemma is obtained by setting δ′ _{= δ/4 and taking a union bound over actions.} (a) We use Eq.3to replace bQπ_(x

i, a). (b) We replace Rπ

j(xi, a) from Eq. 1and use the fact that Qπ_(x i, a) = QπH(xi, a) + γHEx∼νi[V π_{(x)], where} QπH(xi, a) = E r(xi, a) + PH−1t=1 γtr xti, π(xti) and νi = δ(xi)Pa(Pπ)H−1 is the distribution over states induced by starting at state xi, taking action a, and then following the policy π for H − 1 steps. We split the sum using the triangle inequality.

(c) Using the Chernoff-Hoeffding inequality, with probability 1 − δ′

(w.r.t. the samples used to build the rollout estimates), we have

_{M N}1 N X i=1 M X j=1 QπH(xi, a) − Rjπ,H(xi, a) ≤ ǫ1 = (1 − γH)q r 2 log(1/δ′₎ M N .

(d) Using the Chernoff-Hoeffding inequality, with probability 1 − δ′

(w.r.t. the last state reached by the rollout trajectories), we have

γ H M N N X i=1 M X j=1 Vπ(xHi,j) − Ex∼νi[V π_(x)] ≤ ǫ2 = γHq r 2 log(1/δ′₎ M N .

We also use the definition of empirical ℓ1-norm and re-place the second term with ||Vπ_{− b}_Vπ_||

1,bµj, where bµjis

the empirical distribution corresponding to the distri-bution µ = ρPa_(Pπ₎H−1_{. In fact for any 1 ≤ j ≤ M,} samples xH

i,j are i.i.d. from µ.

(e) We move from ℓ1-norm to ℓ2-norm using the Cauchy-Schwarz inequality.

(f ) Note that bV is a random variable independent from the samples used to build the rollout estimates. Using Corollary 12 in (Lazaric et al.,2010c), we have

||Vπ− bVπ||2,bµj ≤ 2||V

π

− bVπ||2,µ+ ǫ3(δ′′) with probability 1 − δ′′

(w.r.t. the samples in bµj) for any j, and ǫ3(δ′′) = 24q

q

2Λ(N,d,δ′′)

N . By taking a union bound over all j’s and setting δ′′

= δ′

/M , we obtain the definition of ǫ3in the final statement. (g) Using Assumption 4, we have ||Vπ _{− b}_{V ||}

2,µ ≤ √ C||Vπ_{− b}_{V ||} 2,σ. (h) We replace ||Vπ_{− b}_{V ||} 2,σ using Proposition1.

(6)

Using the result of Lemma 1, we now prove a perfor-mance bound for a single iteration of DPI-Critic. Theorem 1. Let Π be a policy space with finite VC-dimension h = V C(Π) < ∞ and ρ be a distribution over the state space X . Let N be the number of states in Dk drawn i.i.d. from ρ, H be the horizon of the rollouts, M be the number of rollouts per state-action pair, and bVπk _{be the estimation of the value function}

returned by the critic. Let Assumptions 3and 4hold and πk+1 = arg minπ∈ΠLbπk(bρ; π) be the policy

com-puted at the k’th iteration of DPI-Critic. Then, for any δ > 0, we have

Lπk(ρ; πk+1) ≤ inf_π∈ΠLπk(ρ; π)+2(ǫ0+ǫ1+ǫ2+ǫ3+ǫ4), (5)

with probability 1 − δ, where

ǫ0= 16q s 2 N h logeN h + log 32 δ .

The proof is reported in section7.2.

Remark 1. The terms in the bound of Theorem1are related to the performance at each iteration of DPI-Critic. The first term, infπ∈ΠLπk(ρ; π), is the

approx-imation error of the policy space Π, i.e., the best ap-proximation of the greedy policy in Π. Since the clas-sifier relies on a finite number of samples in its training set, it is not able to recover the optimal approximation and incurs an additional estimation error ǫ0which de-creases as O(N−1/2_{). Furthermore, the training set} of the classifier is built according to action-value es-timates, whose accuracy is bounded by the remaining terms. The term ǫ1 accounts for the variance of the rollout estimates due to the limited number of rollouts for each state in the rollout set. While it decreases as M and N increase, it increases with H, because longer rollouts have a larger variance due to the stochasticity in the MDP dynamics. The terms ǫ2, ǫ3, and ǫ4 are related to the bias induced by truncating the rollouts. They all share a factor γH_{decaying exponentially with} H and are strictly related to the critic’s prediction of the return from H on. While ǫ3depends on the specific function approximation algorithm used by the critic (LSTD in our analysis) just through the dimension d of the function space F, ǫ4is strictly related to LSTD’s performance, which depends on the size n of its train-ing set and the accuracy of its function space, i.e., the approximation error inff ∈F||Vπ− f||2,σ.

Remark 2. We now compare the result of Theorem1

with the corresponding result for DPI inLazaric et al.

(2010b), which bounds the performance as

Lπk(ρ; πk+1) ≤ inf_π∈ΠLπk(ρ; π) + 2(ǫ0+ ǫ1+ γ

H_q). ₍₆₎

While the approximation error infπ∈ΠLπk(ρ; π) and

the estimation errors ǫ0 and ǫ1 are the same in Eqs.5

and6, the difference in the way that these algorithms handle the rollouts after H steps leads to the term γH_{q in DPI and the terms ǫ}

2, ǫ3, and ǫ4in DPI-Critic. The terms ǫ2, ǫ3, and ǫ4have the term γHq multiplied by a factor which decreases with the number of roll-out states N , the number of rollroll-outs M , and the size of the critic training set n. For large enough values of N and n, this multiplicative factor is smaller than 1, thus making ǫ2+ ǫ3+ ǫ4smaller than γHq in DPI. Furthermore, since these ǫ values upper bound the dif-ference between quantities bounded in [−q, q], their values cannot exceed γH_{q. This comparison supports} the idea that introducing a critic improves the accu-racy of the truncated rollout estimates by reducing the bias with no increase in the variance.

Remark 3. Although Theorem1reveals the potential advantage of DPI-Critic w.r.t. DPI, the comparison in Remark 2 does not take into consideration that DPI-Critic uses n samples more than DPI, thus making the comparison potentially unfair. We now analyze the case when the total budget (number of calls to the generative model) of DPI-Critic is fixed to B. The total budget is split in two parts: 1) BR = B(1 − p) the budget available for the rollout estimates and 2) BC= Bp = n the number of samples used by the critic, where p ∈ (0, 1) is the critic ratio of the total budget. By substituting BR and BC in the bound of Theo-rem1 and setting M = 1, we note that for a fixed H, while increasing p increases the estimation error terms ǫ0, ǫ1, ǫ2, and ǫ3 (the rollout set becomes smaller), it decreases the estimation error of LSTD ǫ4(the critic’s training set becomes larger). This trade-off (later re-ferred to as the critic trade-off ) is optimized by a spe-cific value p = p∗

which minimizes the expected error of DPI-Critic. By comparing the bounds of DPI and DPI-Critic, we first note that for any fixed p, DPI benefits from a larger number of samples to build the rollout estimates, thus has smaller estimation errors ǫ0 and ǫ1 w.r.t. DPI-Critic. However, as pointed out in Remark 2, the bias term γH_{q in the DPI bound} is always worse than the corresponding term in the DPI-Critic bound. As a result, whenever the advan-tage obtained by relying on the critic is larger than the loss in having a smaller number of rollouts, we expect DPI-Critic to outperform DPI. Whether this is the case depends on a number of factors such as the di-mensionality and the approximation error of the space F, the horizon H, and the size N of the rollout set. Remark 4. According to Assumption 1 the samples in the critic’s training set are completely independent from those used in building the rollout estimates. A more data-efficient version of the algorithm can be de-vised as follows: We first simulate all the trajectories

(7)

used in the computation of the rollouts and use the last few transitions of each to build the critic’s train-ing set Sk. Then, after the critic (LSTD) computes an estimate of the value function using the samples in Sk, the action-values of the states in the rollout set Dk are estimated as in Eqs.1–3. This way the func-tion approximafunc-tion step does not change the total bud-get. We call this version of the algorithm Combined DPI-Critic (CDPI-Critic). From a theoretical point of view, the main problem is that the samples in Sk are no longer drawn from the stationary distribution σkof the policy under evaluation πk. However, the samples in Skare collected at the end of the rollout trajectories of length H obtained by following πk, and thus, they are drawn from the distribution µ = ρPa_(Pπk)H−1

that approaches σk as H increases. Depending on the mixing rate of the Markov chain induced by πk, the difference between µ and σk could be relatively small, thus supporting the conjecture that CDPI-Critic may achieve a similar performance to DPI-Critic without the overhead of n independent samples. While we leave a detailed theoretical analysis of CDPI-Critic as future work, we use it in the experiments of Section5.

5. Experimental Results

In this section, we report the empirical evaluation of DPI-Critic with LSTD and compare it to DPI (built on truncated rollouts) and LSPI (built on value func-tion approximafunc-tion). In the experiments we show that DPI-Critic, by combining truncated rollouts and func-tion approximafunc-tion, can improve over DPI and LSPI. 5.1. Setting

We consider two standard goal-based RL prob-lems: mountain car (MC) and inverted pen-dulum (IP). We use the formulation of MC

in Dimitrakakis & Lagoudakis (2008) with action

noise bounded in [−1, 1] and γ = 0.99. The value func-tion is approximated using a linear space spanned by a set of radial basis functions (RBFs) evenly distributed over the state space. The critic training set is built us-ing one-step transitions from states drawn from a uni-form distribution over the state space, while LSPI is trained off-policy using samples from a random policy. In IP, we use the same implementation, features, and critic’s training set as in Lagoudakis & Parr (2003a) with γ = 0.95. In both domains, the function space to approximate the action-value function in LSPI is obtained by replicating the state-features for each ac-tion as suggested inLagoudakis & Parr(2003a). Sim-ilar to Dimitrakakis & Lagoudakis(2008), the policy space Π (classifier) is defined by a multi-layer percep-tron with 10 hidden units, and is trained using stochas-tic gradient descent with a learning rate of 0.5 for 400 iterations. In the experiments, instead of directly

solv-ing the cost-sensitive multi-class classification step as in Fig.1, we minimize the classification error. In fact, the classification error is an upper-bound on the em-pirical error defined by Eq.4. Finally, the rollout set is sampled uniformly over the state spaces.

Each DPI-based algorithm is run with the same fixed budget B per iteration. As discussed in Remark 3, DPI-Critic splits the budget into a rollout budget BR = B(1 − p) and a critic budget BC = Bp, where p ∈ (0, 1) is the critic ratio. The rollout budget is divided into M rollouts of length H for each action in A and each state in the rollout set D, i.e., BR = HM N |A|. In CDPI-Critic the critic training set Sk is built using all transitions in the rollout trajectories except the first one. LSPI is run off-policy (i.e., sam-ples are collected once and reused through iterations) and, in order to have a fair comparison, it is run with a total number of samples equal to B times the number of iterations (5 in the following experiments).

In Fig. 2 and 3, we report the performance of DPI, DPI-Critic, CDPI-Critic, and LSPI. In MC, the per-formance is evaluated as number of steps-to-go with a maximum of 300. In IP, the performance is the num-ber of balancing steps with a maximum of 3000 steps. The performance of each run is computed as the best performance over 5 iterations of policy iteration. The results are averaged over 1000 runs. Although in the graphs we report the performance of DPI and LSPI at p = 0 and p = 1, respectively, DPI-Critic does not necessarily tend to the same performance as DPI and LSPI when p approaches 0 or 1. In fact, values of p close to 0 correspond to building a critic with very few samples (thus affecting the performance of the critic), while values of p close to 1 correspond to a very small rollout set (thus affecting the performance of the clas-sifier). We tested the performance of DPI and DPI-Critic on a wide range of parameters (H, M, N ) but we only report the performance of the best combination for DPI, and show the performance of DPI-Critic for the best choice of M (M = 1 was the best choice in all the experiments) and different values of H. 5.2. Experiments

In both MC and IP, the reward function is constant everywhere except at the terminal state. Thus, roll-outs are informative only if their trajectories reach the terminal state. Although this would suggest to have large values for the horizon H, the size of the rollout set would correspondingly decrease as N = O(B/H), thus decreasing the accuracy of the classifier (see ǫ0in Thm.1). This leads to a trade-off (referred to as the rollout trade-off ) between long rollouts (which increase the chance of observing informative rewards) and the

(8)

0.0 0.2 0.4 0.6 0.8 1.0 50 100 150 200 250 300 350 Critic ratio (p) A v er

aged steps to the goal

CDPI H=3 N=22 LSPI DPI H=12, N=5 Horizons of DPI−Critic 1 2 6 10 20 0.0 0.2 0.4 0.6 0.8 1.0 50 100 150 200 250 300 350 Critic ratio (p) A v er

aged steps to the goal

CDPI H=5, N=13 LSPI DPI H=12, N=5 Horizons of DPI−Critic 1 2 6 10 20

Figure 2.Performance of the learned policies in mountain car with a 3 × 3 RBF grid (left) and a 2 × 2 RBF grid (right). The total budget B is set to 200. The objective is to minimize the number of steps to the goal.

number of states in the rollout set. The solution to this trade-off strictly depends on the accuracy of the estimate of the return after a rollout is truncated. As discussed in Sec. 3, while in DPI this return is im-plicitly set to 0, in DPI-Critic it is set to the value returned by the critic. In this case, a very accurate critic would lead to solve the trade-off for small values of H, because the lack of informative rollouts is com-pensated by the critic. On the other hand, when the critic is inaccurate, H should be selected in a way to guarantee a sufficient number of informative rollouts, and at the same time, a large enough rollout set. Fig. 2 shows the learning results in MC with budget B = 200. In the left panel, the function space for the critic consists of 9 RBFs distributed over a uniform grid. Such a space is rich enough for LSPI to learn nearly-optimal policies (about 80 steps to reach the goal). On the other hand, DPI achieves a poor perfor-mance of about 150 steps, which is obtained by solving the rollout trade-off at H = 12 and N = 5. We also re-port the performance of DPI-Critic for different values of H and p. We note that, as discussed in Remark 3, for a fixed H, there exists an optimal value p∗

which optimizes the critic trade-off. For very small values of p, the critic has a very small training set and is likely to return a very poor approximation of the return. In this case, DPI-Critic performs similar to DPI and the rollout trade-off is achieved by H = 12, which limits the effect of potentially inaccurate predictions without reducing too much the size of the rollout set. On the other hand, as p increases the accuracy of the critic improves as well, and the best choice for H rapidly reduces to 1, which corresponds to rollouts built al-most entirely on the basis of the values returned by the critic. For H = 1 and p ≈ 0.8, DPI-Critic achieves a slightly better performance than LSPI. Finally, the

horizontal line represents the performance of CDPI-Critic (for the best choice of H) which improves over DPI without matching the performance of LSPI. Although this experiment shows that the introduction of a critic in DPI compensates for the truncation of the rollouts and improves their accuracy, most of this advantage is due to the quality of F in approximating value functions (LSPI itself is nearly-optimal). In this case, the results would suggest the use of LSPI rather than any DPI-based algorithm. In the next experi-ment, we show that DPI-Critic is able to improve over both DPI and LSPI even if F has a lower accuracy. We define a new space F spanned by 4 RBFs distributed over a uniform grid. The results are reported in the right panel of Fig. 2. The performance of LSPI now worsens to 180 steps. Since the quality of the critic returned by LSTD in DPI-Critic is worse than in the case of 9 RBFs, H = 1 is no longer the best choice for the rollout trade-off. However, as soon as p > 0.1, the accuracy of the critic is still higher than the 0 predic-tion used in DPI, thus leading to the best horizon at H = 6 (instead of 12 as in DPI), which guarantees a large enough number of informative rollouts. At the same time, other effects might influence the choice of the best horizon H. As it can be noticed, for H = 6 and p ≈ 0.5, DPI-Critic successfully takes advantage of the critic to improve over DPI, and at the same time, it achieves a better performance than LSPI. Un-like LSPI, DPI-Critic computes its action-value esti-mates by combining informative rollouts and the critic value function, thus obtaining estimates which cannot be represented by the action-value function space used by LSPI. Additionally, similar to DPI, DPI-Critic per-forms a policy approximation step which could lead to better policies w.r.t. those obtained by LSPI.

(9)

0.0 0.2 0.4 0.6 0.8 1.0 0 1000 2000 3000 Critic ratio (p) A v er

aged balancing steps

CDPI H=3, N=111 LSPI DPI H=6, N=55 Horizons of DPI−Critic 1 2 6 10 20

Figure 3.Performance of the learned policies in inverted pendulum. The budget is B = 1000. The goal is to keep the pendulum balanced with a maximum of 3000 steps.

Finally, Fig. 3 displays the results of similar experi-ments in IP with B = 1000. In this case, although the function space is not accurate enough for LSPI to learn good policies, it is helpful in improving the accu-racy of the rollouts w.r.t. DPI. When p > 0.05, H = 1 is the horizon which optimizes the rollout trade-off. In fact, since by following a random policy the pendulum falls after very few steps, rollouts of length one still allow to collect samples from the terminal state when-ever the starting state is close enough to the horizontal line. Hence, with H = 1 action-values are estimated as a mix of both informative rollouts and the critic’s pre-diction, and at the same time, the classifier is trained on a relatively large training set. Finally, it is interest-ing to note that in this case CDPI-Critic obtains the same nearly-optimal performance as DPI-Critic.

6. Conclusions

DPI-Critic adds value function approximation to the classification-based approach to policy iteration. The motivation behind DPI-Critic is two-fold. 1) In some settings (e.g., those with delayed reward), DPI action-value estimates suffer from either high variance or high bias (depending on H). Introducing critic to the computation of the rollouts may significantly re-duce the bias, which in turn allows for shorter hori-zon and thus lower variance. 2) In value-based ap-proaches (e.g., LSPI), it is often difficult to design a function space which accurately approximates action-value functions. In this case, integrating rough approx-imation of the value function returned by the critic with the rollouts obtained by direct simulation of the generative model may improve the accuracy of the function approximation and lead to better policies.

In Sec.4, we theoretically analyzed the performance of DPI-Critic and showed that depending on several fac-tors (notably the function approximation error), DPI-Critic may achieve a better performance than DPI. This analysis is also supported by the experimental results of Sec.5, which confirm the capability of DPI-Critic to take advantage of both rollouts and critic, and improve over both DPI and LSPI. Although further in-vestigation of the performance of DPI-Critic in more challenging domains is needed and in some settings either DPI or LSPI might still be the better choice, DPI-Critic seems to be a promising alternative that introduces additional flexibility in the design of the al-gorithm. Possible directions for future work include complete theoretical analysis of CDPI-Critic, a more detailed comparison of DPI-Critic and LSPI, and find-ing optimal or good rollout allocation strategies. Acknowledgments Experiments presented in this paper were carried out using the Grid’5000 experi-mental testbed (https://www.grid5000.fr). This work was supported by Ministry of Higher Education and Research, Nord-Pas de Calais Regional Council and FEDER through the “contrat de projets ´etat region 2007–2013”, and by PASCAL2 Network of Excellence.

7. Appendix

7.1. DPI-Critic with Bellman Residual Minimization

In this section we bound the performance of each iter-ation of DPI-Critic when Bellman Residual Minimiza-tion (BRM) is used to train the critic.

Assumption 3. At each iteration k of DPI-Critic, the critic uses a linear function space spanned by d bounded basis functions (see Section 2). A data-set Sk =

(Xi, Ri, Yi, Yi′) n

i=1 is built, where Xi ∼ τ, Ri = r Xi, πk(Xi)

, and Yi and Yi′ are two inde-pendent states drawn from Pπk(·|X

i). Note that here in BRM (unlike LSTD) the sampling distribution τ can be any distribution over the state space.

Assumption 4. The rollout set sampling distribution ρ is such that for any policy π ∈ Π and any action a ∈ A, µ = ρPa_(Pπ₎H−1 _{≤ C}

kτ , where Ck < ∞ is a constant. The distribution µ is a distribution induced by starting at a state sampled from ρ, taking action a, and then following policy π for H − 1 steps.

We first report the performance bound for BRM. Proposition 2. (Thm. 7 inMaillard et al. 2010) Let n samples be collected as in Assumption 1 and bVπ be the approximation returned by BRM using the linear function space F as defined in Section 2. Then for any

(10)

δ > 0, we have ||Vπ− bVπ||2,τ ≤ ǫBRM= (7) ||(I − γPπ)−1||τ (I + γ||Pπ||τ inf f ∈F||V π − f||2,τ + c 2d log(2) + 6 log(64|A|/δ) n 1/4!

with probability 1 − δ, where

(1) c = 12[2 ξ(1 + γ) 2_L2_{+ 1]R} max, (2) ξ = ω ||(I−γPπ₎−1||2,

(3) ω > 0 is the smallest strictly positive eigenvalue of the Gram matrix w.r.t. the distribution τ .

In the following lemma, we bound the difference be-tween the actual action-value function and the one es-timated by DPI-Critic.

Lemma 2. Let Assumptions3and4hold and {xi}Ni=1 be the rollout set with xi iid∼ ρ. Let Qπ be the true action-value function of policy π and bQπ _{be its} esti-mation computed by DPI-Critic using M rollouts with horizon H. Then for any δ > 0, we have

max a∈A _N1 N X i=1 Qπ(xi, a) − bQπ(xi, a) ≤ ǫ1+ ǫ2+ ǫ3+ ǫ4, with probability 1 − δ (w.r.t. the random rollout esti-mates and the random samples in the critic’s training set Sk), where ǫ1= (1 − γH)q r 2 log(4|A|/δ) M N , ǫ2= γ H_q r 2 log(4|A|/δ) M N , ǫ3= 12γHB s 2Λ(N, d, δ 4|A|M) N , ǫ4 = 2γ H√_{C ǫ} BRM, with Λ(N, d, δ) = 2(d + 1) log(N ) + loge δ + log 9(12e) 2(d+1)_, B = q1 +2(1 − γ 2_)L2 ||(I − γPπk₎−1_||2 τ ω .

Proof. We prove the following series of inequalities:

1 N N X i=1 [Qπ(xi, a) − bQπ(xi, a)] (a) = 1 M N N X i=1 M X j=1 [Qπ(xi, a) − Rjπ(xi, a)] (b) ≤_{M N}1 N X i=1 M X j=1 [QπH(xi, a) − Rπ,Hj (xi, a)] + γ H M N N X i=1 M X j=1 _b Vπ(xHi,j) − Ex∼νi[V π_(x)] (c) ≤ ǫ1+ γ H M N N X i=1 M X j=1 _b Vπ(xHi,j) − Vπ(xHi,j) + γ H M N N X i=1 M X j=1 Vπ(xHi,j) − Ex∼νi[V π_(x)] w.p. 1 − δ′ (d) ≤ ǫ1+ ǫ2+γ H M M X j=1 ||Vπ− bVπ||1,bµj w.p. 1 − 2δ ′ (e) ≤ ǫ1+ ǫ2+γ H M M X j=1 ||Vπ− bVπ||2,bµj w.p. 1 − 2δ ′ (f) ≤ ǫ1+ ǫ2+ ǫ3+ 2γH||Vπ− bVπ||2,µ w.p. 1 − 3δ′ (g) ≤ ǫ1+ ǫ2+ ǫ3+ 2γH √ C ||Vπ− bVπ||2,τ (h) ≤ ǫ1+ ǫ2+ ǫ3+ 2γH √ C ǫBRM w.p. 1 − 4δ′ The statement of the lemma is obtained by setting δ′

= δ/4, by taking the union bound over actions. (a) We use Eq. 3 to replace bQπ_(x

i, a). (b) We replace Rπ

j(xi, a) from Eq. 1 and use the fact that Qπ_(x i, a) = QπH(xi, a) + γHEx∼νi[V π_{(x)], where} Qπ H(xi, a) = E r(xi, a) + PH−1t=1 γtr xti, π(xti) and νi = δ(xi)Pa(Pπ)H−1 is the distribution over states induced by starting at state xi, taking action a, and then following the policy π for H − 1 steps. We split the sum using the triangle inequality.

(c) Using the Chernoff-Hoeffding inequality, with probability 1 − δ′

(w.r.t. the random samples used to build the rollout estimates), we have

_{M N}1 N X i=1 M X j=1 [QπH(xi, a) − Rπ,Hj (xi, a)] ≤ ǫ1 = (1 − γH)q r 2 log(1/δ′₎ M N .

(d) Using the Chernoff-Hoeffding inequality, with probability 1 − δ′

(w.r.t. the last state achieved by the rollouts trajectories), we have

γ H M N N X i=1 M X j=1 Vπ(xHi,j) − Ex∼νi[V π (x)]_{≤ ǫ}2 = γHq r 2 log(1/δ′₎ M N .

We also use the definition of empirical ℓ1-norm and replace the second term with ||Vπ _{− b}_Vπ_||

1,bµj, where

b

µj is the empirical distribution corresponding to the distribution µ = ρPa_(Pπ₎H−1_{. In fact for any 1 ≤ j ≤} M , the samples xH

i,j are i.i.d. from µ.

(e) We move from ℓ1-norm to ℓ2-norm: for any x ∈ Rn, using the Cauchy-Schwarz inequality, we obtain

||x||1,bµ= 1 n n X i=1 |xi| ≤ v u u t n X i=1 1 n v u u t n X i=1 1 n|xi| 2_{= ||x||} 2,bµ.

(11)

(f ) We note that bV is a random variable independent from the samples used to build the rollout estimates. Thus, applying Corollary 12 inLazaric et al.(2010c), for any j we have

||Vπ− bVπ||2,bµj ≤ 2||V

π

− bVπ||2,µ+ ǫ3(δ′′)

with probability 1 − δ′′

(w.r.t. the samples in bµj) and ǫ3(δ′′) = 12B

q

2Λ(N,d,δ′′₎

N . Using the upper bound on the solutions returned by BRM, when the number of samples n is large enough, then with high probability (see Corollary 5 inMaillard et al. 2010for details)

B = q1 +2(1 − γ 2_)L2_{||(I − γP}πk₎−1_||2 τ ω .

Finally, by taking a union bound over all j’s and set-ting δ′′

= δ′

/M , we obtain the definition of ǫ3 in the final statement.

(g) Using Assumption 4, we have ||Vπ _{− b}_{V ||}

2,µ ≤

√

C||Vπ_{− b}_{V ||} 2,τ.

(h) Here, we simply replace ||Vπ _{− b}_{V ||}

2,τ with the bound in Proposition2.

7.2. Proof of Theorem 1

Since the proof of Theorem 1 does not depend on the specific critic employed in DPI-Critic, we report its proof in the general form where the ǫ terms depend on the specific Lemma (Lemma 1 for LSTD or Lemma 2 for BRM) used in the proof.

Proof. The proof follows the same steps as in Thm. 1

inLazaric et al.(2010a). We prove the following series

of inequalities: Lπk(ρ; πk+1) (a) ≤ Lπk(bρ; πk+1) + ǫ0 w.p. 1 − δ′ = 1 N N X i=1 Qπk_(x i, a∗)−Qπk xi, πk+1(xi)+ ǫ0 (b) ≤ _N1 N X i=1 Qπk_(x i, a∗) − bQπk xi, πk+1(xi) + ǫ0+ ǫ1+ ǫ2+ ǫ3+ ǫ4 w.p. 1 − 2δ′ (c) ≤ _N1 N X i=1 Qπk_(x i, a∗) − bQπk xi, π∗(xi) + ǫ0+ ǫ1+ ǫ2+ ǫ3+ ǫ4 (d) ≤ _N1 N X i=1 Qπk_(x i, a∗) − Qπk xi, π∗(xi) + ǫ0+ 2(ǫ1+ ǫ2+ ǫ3+ ǫ4) w.p. 1 − 3δ′ = Lπk(bρ; π∗) + ǫ0+ 2(ǫ1+ ǫ2+ ǫ3+ ǫ4) (e) ≤ Lπk(ρ; π ∗_{) + 2(ǫ} 0+ ǫ1+ ǫ2+ ǫ3+ ǫ4) w.p. 1 − 4δ′ The statement of the theorem follows by δ′ _{= δ/4.} (a) It is an immediate application of Lemma 1

in Lazaric et al.(2010a).

(b) This is the result of Lemma2(for BRM). (c) From the definition of πk+1 in the DPI-Critic al-gorithm we have πk+1= arg min π∈Π b Lπk(bρ; π) = arg max π∈Π 1 N N X i=1 b Qπk _x i, π(xi), thus, −N1 PN i=1Qbπk xi, π(xi) can be maximized by replacing πk+1with any other policy particularly with π∗

= arg minπ∈ΠLπk(ρ; π).

(d)-(e) These steps follow from Lemma 2 and Lemma 1 inLazaric et al.(2010a).

7.3. Error Propagation

In this section, we first show how the expected error is propagated through the iterations of DPI-Critic. We then analyze the error between the value function of the policy obtained by DPI-Critic after K iterations and the optimal value function in η-norm, where η is a distribution over the states which might be dif-ferent from the sampling distribution ρ. Let Pπ _be the transition kernel for policy π, i.e., Pπ_{(dy|x) =} p dy|x, π(x). It defines two related operators: a right-linear operator, Pπ·, which maps any V ∈ BV(X ; q) to (Pπ_{V )(x) =}R _{V (y)P}π_{(dy|x), and a left-linear} opera-tor, ·Pπ_{, that returns (ηP}π_{)(dy) =} R _Pπ_(dy|x)η(dx) for any distribution η over X .

From the definitions of ℓπk, T

π_{, and T , we have} ℓπk(πk+1) = T V

πk _{− T}πk+1_Vπk_{. We deduce the}

fol-lowing pointwise inequalities:

Vπk_{− V}πk+1 = Tπk_Vπk_{− T}πk+1_Vπk_{+ T}πk+1_Vπk_{− T}πk+1_Vπk+1 ≤ ℓπk(πk+1) + γP πk+1_(Vπk_{− V}πk+1_), which gives usVπk_{− V}πk+1_{≤ (I − γP}πk+1₎−1_ℓ πk(πk+1). We also have V∗− Vπk+1_{= T V}∗_{− T V}πk_{+ T V}πk − Tπk+1_Vπk_{+ T}πk+1_Vπk_{− T}πk+1_Vπk+1 ≤ γP∗(V∗− Vπk_{) + ℓ} πk(πk+1) + γP πk+1_(Vπk_{− V}πk+1_), which yields V∗− Vπk+1_{≤ γP}∗_(V∗_{− V}πk₎ +γPπk+1_{(I − γP}πk+1₎−1_{+ I}_ℓ πk(πk+1) = γP∗(V∗− Vπk_{) + (I − γP}πk+1₎−1_ℓ πk(πk+1).

(12)

Finally, by defining the operator Ek = (I −γPπk+1)−1, which is well defined since Pπk+1_{is a stochastic kernel}

and γ < 1, and by induction, we obtain

V∗−VπK_{≤ (γP}∗₎K_(V∗_−Vπ0₎₊

K−1_X k=0

(γP∗)K−k−1Ekℓπk(πk+1). (8)

Eq.8 shows how the error at each iteration k of DPI-Critic, ℓπk(πk+1), is propagated through the iterations

and appears in the final error of the algorithm, V∗ − VπK_{. Since we are interested in bounding the final}

error in η-norm, which might be different than the sampling distribution ρ, we use one of the following assumptions:

Assumption 5. For any policy π ∈ Π and any non-negative integers s and t, there exists a constant Cη,ρ(s, t) < ∞ such that η(P∗)s(Pπ)t ≤ Cη,ρ(s, t)ρ. We define Cη,ρ=1−γ₂ P ∞ s=0 P∞ t=0γs+tCη,ρ(s, t). Assumption 6. For any x ∈ X and any a ∈ A, there exist a constant Cρ< ∞ such that p(·|x, a) ≤ Cρρ(·). Note that concentrability coefficients similar to Cη,ρ and Cρwere previously used in the ℓp-analysis of fitted value iteration (Munos, 2007; Munos & Szepesv´ari,

2008) and approximate policy iteration (Antos et al.,

2008). We now state our main result.

Theorem 2. Let Π be a policy space with finite VC-dimension h and πK be the policy generated by DPI-Critic after K iterations. Let M be the number of rollouts per state-action and N be the number of sam-ples drawn i.i.d. from a distribution ρ over X at each iteration of DPI-Critic. Then, for any δ > 0, we have

||V∗− VπK_|| 1,η≤ 2 1 − γ h Cη,ρ sup π∈Π inf π′∈ΠLπ(ρ; π ′₎ + 2(ǫ0+ ǫ1+ ǫ2+ ǫ3+ ǫ4) + γKRmax i , under Assumption 1 ||V∗− VπK_|| ∞≤ 2 1 − γ h Cρ sup π∈Π inf π′∈ΠLπ(ρ; π ′₎ + 2(ǫ0+ ǫ1+ ǫ2+ ǫ3+ ǫ4) + γKRmax i , under Assumption 2

with probability 1 − δ, where ǫ4 = ǫLSTD when LSTD is used to approximate the critic and ǫ4= ǫBRMwhen BRM is used.

Proof. We have Cη,ρ ≤ Cρ for any η. Thus, if the ℓ1 -bound holds for any η, choosing η to be a Dirac at each state implies that the ℓ∞-bound holds as well. Hence, we only need to prove the ℓ1-bound. By taking the absolute value point-wise in Eq.8we obtain

0.0 0.2 0.4 0.6 0.8 1.0 30 40 50 60 70 80 90 100 Critic ratio (p) Accur acy percentage Horizons of DPI−Critic Hz = 1 Hz = 4 Hz = 6

Figure 4.The accuracy of training set in the inverted pen-dulum problem. |V∗− VπK_{| ≤ (γP}∗₎K_|V∗_{− V}π0 | + K−1_X k=0 (γP∗)K−k−1(I − γPπk+1₎−1_|ℓ πk(πk+1)|.

From the fact that |V∗

− Vπ0_{| ≤} 2

1−γRmax1, and by integrating both sides w.r.t. η, and using Assumption5

we have ||V∗− VπK_|| 1,η≤ γK 2 1 − γRmax + K−1_X k=0 γK−k−1 ∞ X t=0 γtCη,ρ(K − k − 1, t)||ℓπk(πk+1)||1,ρ.

The claim follows from the definition of Cη,ρ and by bounding Lπk(ρ; πk+1) using Theorem 1 with a union

bound argument over the K iterations. 7.4. Accuracy of the Training Set

Figure4shows the accuracy of the training set of DPI-Critic w.r.t. p for different values of H. At p = 0 we report the performance of the DPI algorithm. The ac-curacy acc is computed as the percentage of the states in the rollout set at which the true greedy action is correctly identified. Let π be a fixed policy. With a rollout set containing N states acc is computed as:

acc = 1 N N X i=1 I{a∗i = ˆai}, where a∗

i = arg maxa∈AQπ(xi, a) and ˆai = arg maxa∈AQbπ(xi, a) are the actual greedy action and

(13)

the greedy action estimated by the algorithm at state xi∈ D, respectively.

The budget B is set to 2000, N is set to 20, and the values of M are computed as

M (p, H) = B N |A|×

1 − p

H ,

where the first term is a constant and the second one indicates that M decreases linearly with p with a co-efficient inversely proportional to H. The results are averaged over 1000 runs. The policy π is fixed as fol-lows:

π(θ, ˙θ) =   

with probability 0.2, random action, with probability 0.8, left, if _π/26θ > ˙θ ,

right, otherwise .

This policy has an average performance of 20 steps to balance the pendulum.

For H = 1 and p = 0, the horizon is not long enough to collect any informative reward (i.e., reaching a ter-minal state). Therefore, the accuracy of the training set is almost the same as for a training set in which the greedy action is selected at random (33% as the domain contains 3 actions). For positive values of p, DPI-Critic adds an approximation of the value func-tion to the rollout estimates. The benefit obtained by using the critic increases with the quality of the approximation (i.e., when p increases). At the same time, when p increases, the number of rollouts M for each rollout state decreases, thus increasing the vari-ance of rollout estimates. For H = 1, this reduction has no effect on the accuracy of the training set except when M is forced to be 0 for the values of p close to 1. In fact, when H = 1, the variance is limited to the variance introduced by the noise in one single transi-tion and even a very small number of rollouts would be enough to obtain accurate estimates of the action values.

The accuracy of DPI (p = 0) improves with horizon H = 4 and H = 6. For p close to 0, the critic has a poor performance which causes the rollouts to be less accurate than in DPI. On the other hand, when p increases, the value function approximation is suffi-ciently accurate to make DPI-Critic improve over the accuracy of DPI. However for p > 0.5, the accuracy of the training set starts decreasing. Indeed, as H is large, the variance of bQπ estimates is bigger than in the case when H = 1. Moreover, for large values of H, M is small. Therefore, with high variance and a small

number of rollouts M , the bQπ _{estimates are likely to} be inaccurate.

These results show that the introduction of a critic improves the accuracy of the training set for a value of p which balances between having a sufficiently ac-curate approximation of the critic (p large leads to an accurate critic) and having sufficiently many rollouts M (p small leads to rollouts with less variance). This still does not account for the complete critic trade-off since the parameter N is fixed. Indeed, in order to return a good approximation of the greedy policy, it is essential for the classifier to both have an accurate training set and a large number of states in the train-ing set. The impact of the number of states in the training set and the propagation through iterations of the benefit obtained from the use of critic are studied in the Experiments section.

References

Antos, A., Szepesv´ari, Cs., and Munos, R. Learn-ing near-optimal policies with Bellman-residual min-imization based fitted policy iteration and a single sample path. Machine Learning Journal, 71:89–129, 2008.

Barto, A., Sutton, R., and Anderson, C. Neuron-like elements that can solve difficult learning control problems. IEEE Transaction on Systems, Man and Cybernetics, 13:835–846, 1983.

Bradtke, S. and Barto, A. Linear least-squares algo-rithms for temporal difference learning. Journal of Machine Learning, 22:33–57, 1996.

Dimitrakakis, C. and Lagoudakis, M. Rollout sam-pling approximate policy iteration. Machine Learn-ing Journal, 72(3):157–171, 2008.

Fern, A., Yoon, S., and Givan, R. Approximate policy iteration with a policy language bias. In Proceedings of NIPS 16, 2004.

Gabillon, V., Lazaric, A., Ghavamzadeh, M., and Scherrer, B. Classification-based policy iteration with a critic. Technical Report 00482065, INRIA, 2011.

Lagoudakis, M. and Parr, R. Least-squares policy it-eration. JMLR, 4:1107–1149, 2003a.

Lagoudakis, M. and Parr, R. Reinforcement learning as classification: Leveraging modern classifiers. In Proceedings of ICML, pp. 424–431, 2003b.

(14)

Lazaric, A., Ghavamzadeh, M., and Munos, R. Anal-ysis of a classification-based policy iteration algo-rithm. In Proceedings of the Twenty-Seventh Inter-national Conference on Machine Learning, pp. 607– 614, 2010a.

Lazaric, A., Ghavamzadeh, M., and Munos, R. Anal-ysis of a classification-based policy iteration algo-rithm. Technical Report 00482065, INRIA, 2010b. Lazaric, A., Ghavamzadeh, M., and Munos, R.

Finite-sample analysis of least-squares policy iteration. Technical Report inria-00528596, INRIA, 2010c. Maillard, O., Munos, R., Lazaric, A., and

Ghavamzadeh, M. Finite-sample analysis of Bell-man residual minimization. In Proceedings of the Second Asian Conference on Machine Learning, pp. 299–314, 2010.

Munos, R. Performance bounds in Lp norm for ap-proximate value iteration. SIAM Journal of Control and Optimization, 2007.

Munos, R. and Szepesv´ari, Cs. Finite time bounds for fitted value iteration. Journal of Machine Learning Research, 9:815–857, 2008.

Sutton, R. and Barto, A. Reinforcement Learning: An Introduction. MIP Press, 1998.