Batch mode reinforcement learning based on the synthesis of artificial trajectories

(1)

Batch mode reinforcement learning

based on the synthesis of artificial

trajectories

Damien Ernst

University of Li `ege, Belgium

(2)

Batch mode Reinforcement Learning

≃

Learning a high-performance policy for a sequential decision problem where:

• a numerical criterion is used to define the performance of a policy. (An optimal policy is the policy that maximizes this numerical criterion.)

• “the only” (or “most of the”) information available on the sequential decision problem is contained in a set of trajectories.

Batch mode RL stands at the intersection of three worlds: optimization (maximization of the numerical criterion), system/control theory (sequential decision problem) and machine learning (inference from a set of trajectories).

(3)

A typical batch mode RL problem

Discrete-time dynamics:

xt+1 = f (xt,ut,wt) t= 0, 1, . . . , T − 1 where xt ∈ X , ut ∈ U

and wt∈ W . wt is drawn at every time step according to Pw(·).

Reward observed after each system transition: rt = ρ(xt,ut,wt) where ρ : X × U × W → R is the reward

function.

Type of policies considered: h: {0, 1, . . . , T − 1} × X → U. Performance criterion: Expected sum of the rewards observed over the T -length horizon

PCh_{(x ) = J}h_{(x ) =} _E w0,...,wT−1

[PT−1

t=0 ρ(xt,h(t, xt), wt)] with x0= x

and xt+1= f (xt,h(t, xt), wt).

Available information: A set ofelementary pieces of trajectoriesFn= {(xl,ul,rl,yl)}n_l₌₁where yl is the state reached after taking action ul in state xl and rl the instantaneous reward associated with the transition. The

(4)

Batch mode RL and function approximators

Training function approximators(radial basis functions, neural nets, trees, etc) using the information contained in the set of trajectories is akey elementto most of the resolution schemes for batch mode RL problems with state-action spaces having a large (infinite) number of elements.

Two typical uses of FAs for batch mode RL:

• the FAsmodel of the sequential decision problem (in our typical problem f , r and Pw). The model is afterwards

exploited as if it was the real problem to compute a high-performance policy.

• the FAs represent (state-action)value functions which are used in iterative schemes so as to converge to a (state-action) value function from which a

high-performance policy can be computed. Iterative schemes based on the dynamic programming principle (e.g., LSPI, FQI, Q-learning).

(5)

Why look beyond function approximators?

FAs based techniques: mature, can successfully solve many real life problemsbut:

1. not well adapted to risk sensitive performance criteria

2. may lead to unsafe policies - poor performance guarantees

3. may make suboptimal use of near-optimal trajectories

4. offer little clues about how to generate new experiments in an optimal way

(6)

1. not well adapted to risk sensitive performance criteria

An example of risk sensitive performance criterion:

PCh_{(x ) =}

(

−∞ if P(PT−1

t=0 ρ(xt,h(t, xt), wt) < b) > c

Jh_{(x ) otherwise.}

FAs with dynamic programming: very problematic because

(state-action) value functions need to become functions that take as values “probability distributions of future rewards” and not “expected rewards”.

FAs with model learning: more likely to succeed; but what

about the challenges of fitting the FAs to model the distribution of future states reached (rewards collected) by policies and not only an average behavior?

(7)

2. may lead to unsafe policies - poor performance

guarantees

•Benchmark: •Trajectory set •Trajectory set not puddle world covering the puddle covering the puddle

•RL algorithms: ⇒ Optimal policy ⇒ Suboptimal

FQI with trees (unsafe) policy

Typical performance guarantee in the deterministic case for FQI = (estimated return by FQI of the policy it outputsminus

constant×(’size’ of the largest area of the state space not covered by the sample)).

(8)

3. may make suboptimal use of near-optimal

trajectories

Suppose a deterministic batch mode RL problem and that in the set of trajectory, there is a trajectory:

(x₀opt. traj.,u0,r0,x1,u1,r1,x2, . . . ,xT−2,uT−2,rT−2,xT−1,uT−1,rT−1,xT)

where the uts have been selected by an optimal policy.

Question:Which batch mode RL algorithms will output a policy which is optimal for the initial state x₀opt. traj.whatever the other trajectories in the set?Answer:Not that many and certainly not those using parametric FAs.

In my opinion: batch mode RL algorithms can only be successful on large-scale problems if(i) in the set of trajectories, many trajectories have been generated by (near-)optimal policies(ii)the algorithms exploit very well the information contained in those (near-)optimal trajectories.

(9)

4. offer little clues about how to generate new

experiments in an optimal way

Many real-life problems are variants of batch mode RL

problems for which (a limited number of) additional trajectories can be generated (under various constraints) to enrich the initial set of trajectories.

Question:How should these new trajectories be generated? Many approaches based on the analysis of the FAs produced by batch mode RL methods have been proposed; results are mixed.

(10)

Rebuilding trajectories

We conjecture that mapping the set of trajectories intoFAs

generally lead to theloss of essential informationfor

addressing these four issues⇒We have developed a new line of research for solving batch mode RL that does not use at all FAs.

Line of research articulated around thethe rebuilding of artificial (likely “broken”) trajectoriesby using the set of trajectories input of the batch mode RL problem; a rebuilt trajectory is defined by the elementary pieces of trajectory it is made of.

The rebuilt trajectories are analysed to compute various things: a high-performance policy, performance guarantees, where to sample, etc.

(11)

BLUE ARROW

= elementary piece of trajectory

Set of trajectories

Examples of 5-length

given as input of the

rebuilt trajectories made

(12)

Model-Free Monte Carlo Estimator

Building anoraclethat estimates the performance of a policy: important problem in batch mode RL.

Indeed, if an oracle is available, problem of estimating a high-performance policy can bereduced to an optimization

problemover a set of candidate policies.

If a model of sequential decision problem is available, aMonte Carlo estimator(i.e., rollouts) can be used to estimate the performance of a policy.

We detail an approach that estimates the performance of a policy by rebuilding trajectories so as tomimic the behavior of

(13)

Context in which the approach is presented

Discrete-time dynamics:

xt+1 = f (xt,ut,wt) t= 0, 1, . . . , T − 1 where xt ∈ X , ut ∈ U

and wt∈ W . wt is drawn at every time step according to Pw(·)

Reward observed after each system transition: rt = ρ(xt,ut,wt) where ρ : X × U × W → R is the reward

function.

Type of policies considered: h: {0, 1, . . . , T − 1} × X → U. Performance criterion: Expected sum of the rewards observed over the T -length horizon

PCh_{(x ) = J}h_{(x ) =} _E w0,...,wT−1

[PT−1

t=0 ρ(xt,h(t, xt), wt)] with x0= x

and xt+1= f (xt,h(t, xt), wt).

Available information: A set ofelementary pieces of

trajectoriesFn= {(xl,ul,rl,yl)}nl=1. f , ρ and Pw are unknown.

Approach aimed at estimating Jh_{(x ) from F} n.

(14)

Monte Carlo Estimator

Generate nbTrajT -length trajectories by simulating the system

starting from the initial state x0; for every trajectory compute the

sum of rewards collected; average these sum of rewards over the nbTraj trajectories to get an estimateMCEh(x0)of Jh(x0).

trajectory 2 r1 r2 r0 r4 r3= r (x3,h(3,x3),w3) trajectory 1

sum rew.traj.1=P4i=0ri

MCEh_(x 0) = 13

P3

i=1sum rew.traj.i w3∼ Pw(·)_x 4= f (x3,h(3,x3),w3) x3 Illustration with and T= 5 nbTraj= 3 x0 trajectory 3 Bias MCEh_(x 0)= E

nbTraj∗T rand. var . w ∼Pw(·)

[MCEh_(x

0) − Jh(x0)]= 0

Var. MCEh_(x

(15)

Description of Model-free Monte Carlo Estimator

(MFMC)

Principle: To rebuild nbTraj T -length trajectories using the

elements of the setFnand to average the sum of rewards

collected along the rebuilt trajectories to get an estimate

MFMCEh_(F

n,x0) of Jh(x0).

Trajectories rebuilding algorithm: Trajectories are

sequentially rebuilt; an elementary piece of trajectory can only be used once; trajectories are grown in length by selecting at every instant t= 0, 1, . . . , T − 1 the elementary piece of trajectory(x , u, r , y ) that minimizes the

distance∆((x , u), (xend_,_{h(t, x}end₎₎₎

where xend _{is the ending state of the already rebuilt part of the}

trajectory (xend _{= x}

(16)

Remark: When sequentially selecting the pieces of

trajectories, no information on the value of the disturbance w “behind” the new piece of elementary trajectory

(x , u, r = ρ(x , u, w), y = f (x , u, w)) that is going to be selected is given if only(x , u) and the previous elementary pieces of trajectories selected are known. Important for having a meaningful estimator!!! rebuilt trajectory 2 r3 r21 r9 rebuilt trajectory 1 x0 rebuilt trajectory 3 r18 Illustration with T= 5 and (x18 ,r 18 ,u 18 ,y 18₎

sum rew.re.traj.1= r3_{+ r}18_{+ r}21_{+ r}7_{+ r}9

MFMCEh_(Fn_,x_{0) =} 1 3

P3

i=1sum rew.re.traj.i nbTraj= 3, r7 F24= {(xl,r l ,u l ,y l_)}24 l=1

(17)

Analysis of the MFMC

Random set ˜Fndefined as follows:

Made of n elementary pieces of trajectory where the first two components of an element(xl

,ul) are given by the first two element of the lth element ofFnand the last two are generated

by drawing for each l a disturbance signal wl _{at random from}

PW(·) and taking rl = ρ(xl,ul,wl) and yl = f (xl,ul,wl). Fnis arealization of the random set ˜Fn.

Biasandvarianceof MFMCE defined as: Bias MFMCEh_{( ˜}_F n,x0) = E w1_,...,wn_∼P[MFMCE_w h_{( ˜}_F n,x0) − Jh(x0)] Var . MFMCEh_{( ˜}_F n,x0) = E w1_,...,wn_∼P w [(MFMCEh_{( ˜}_F n,x0) − E w1_,...,wn_∼P w [MFMCEh_{( ˜}_F n,x0)])2]

We providebounds of the bias and variance of this estimator.

(18)

Assumptions

1] The functions f , ρ and h areLipschitz continuous: ∃Lf,Lρ,Lh∈ R +_{: ∀(x , x}′_,_{u, u}′_,_{w) ∈ X}2_{× U}2_{× W ;} kf (x , u, w) − f (x′_,_u′_,_w_)k X ≤ Lf(kx − x′kX+ ku − u′kU) |ρ(x , u, w) − ρ(x′_,_u′_,_{w)| ≤ L} ρ(kx − x ′_k X + ku − u′kU) kh(t, x ) − h(t, x′_)k U≤ Lhkx − x′k ∀t ∈ {0, 1, . . . , T − 1}.

2] Thedistance∆is chosen such that: ∆((x , u), (x′_,_u′_{)) = (kx − x}′_k

(19)

Characterization of the bias and the variance

Theorem. Bias MFMCEh_{( ˜}_F n,x0) ≤ C ∗ sparsity of Fn(nbTraj∗ T ) Var . MFMCEh_{( ˜}_F n,x0) ≤ (pVar. MCEh_(x 0) + 2C ∗ sparsity of Fn(nbTraj∗ T ))2 with C= Lρ PT−1 t=0 PT−t−1

i=0 [Lf(1 + Lh)]i and with the

sparsity of Fn(k ) defined as the minimal radius r such that all

balls in X× U of radius r contain at least k state-action pairs (xl_,_ul_{) of the set F}

(20)

Test system

Discrete-time dynamics: xt+1= sin(π₂(xt+ ut+ wt)) with

X = [−1, 1], U = [−1₂,1₂], W = [−0.1₂ ,−0.1₂ ] and Pw(·) a uniform

pdf.

Reward observed after each system transition: rt =_2π1e− 1 2(x 2 t+u2t)+ w t

Performance criterion: Expected sum of the rewards observed over a 15-length horizon (T= 15).

We want to evaluate the performance of the policy

h(t, x ) = −x

(21)

Simulations for nbTraj= 10 and size of Fn= 100, ..., 10000.

(22)

Simulations for nbTraj = 1, . . . , 100 and size of Fn= 10, 000.

(23)

Remember what was said about RL + FAs:

1. not well adapted to risk sensitive performance criteria

Suppose the risk sensitive performance criterion:

PCh(x ) = ( −∞ if P(Jh_{(x ) < b) > c} Jh_{(x ) otherwise} where Jh_{(x ) = E[}PT−1 t=0 ρ(xt,h(t, xt), wt)].

MFMCE adapted to this performance criterion:

Rebuilt nbTrajstarting from x0using the setFnas done with the

MFMCE estimator. Let sum rew traj i be the sum of rewards collected along the ith trajectory. Output as estimation of PCh_(x 0) :    −∞ if PnbTraj

i=1 I{sum rew traj i<b}

nbTraj >c

PnbTraj

i=1

sum rew traj i

(24)

MFMCE in the deterministic case

We consider from now on that: xt+1 = f (xt,ut) and

rt = ρ(xt,ut).

One single trajectory is sufficient to compute exactly Jh_(x 0) by

Monte Carlo estimation.

Theorem.Let[(xlt_,_ult_,_rlt_,_ylt_)]T−1

t=0 be the trajectory rebuilt by

the MFMCE when using the distance measure ∆((x , u), (x′

,u′)) = kx − x′k + ku − u′k. If f , ρ and h are Lipschitz continuous, we have

|MFMCEh(x0) − Jh(x0)|≤ T−1 X t=0 LQT−t∆((y lt−1_,_{h(t, y}lt−1_{)), (x}lt ,ult)) where yl−1 _{= x} 0and LQN = Lρ( PN−1 t=0 [Lf(1 + Lh)]t).

(25)

Previous theorem extends to whatever rebuilt trajectory:

Theorem.Let[(xlt_,_ult_,_rlt_,_ylt_)]T−1

t=0 be any rebuilt trajectory. If f ,

ρand h are Lipschitz continuous, we have | T−1 X t=0 rlt − J h_(x 0)| ≤ T−1 X t=0 LQT−t∆((y lt−1_,_{h(t, y}lt−1_{)), (x}lt_,_ult₎₎ where∆((x , u), (x′_,_u′_{)) = kx − x}′_{k + ku − u}′_{k, y}l−1 _{= x} 0and LQN = Lρ( PN−1 t=0 [Lf(1 + Lh)]t). r2= r (x2,h(2,x2)) r3

trajectory generated by policy h

∆2= L_Q3(kyl1− xl2k + kh(2,yl1) − ul2k) rl4 rl1 xl2,ul2 |P4 t=0rt−P4t=0rlt| ≤ P4 t=0∆t r1 r0 x0 r5 rl0 r l2 rl3 yl1

(26)

Computing a lower bound on a policy

From previous theorem, we have for any rebuilt trajectory [(xlt_,_ult_,_rlt_,_ylt_)]T−1 t=0 : Jh(x0) ≥ T−1 X t=0 rlt₋ T−1 X t=0 LQT−t∆((y lt−1_,_{h(t, y}lt−1_{)), (x}lt_,_ult₎₎

This suggests to find the rebuilt trajectory that maximizes the right-hand side of the inequality to compute atightlower bound on h. Let: lower bound(h, x0,Fn), max [(xlt,ult,rlt,ylt)]T−1 t=0 PT−1 t=0 rlt− PT−1 t=0 LQT−t∆((y lt−1_,_{h(t, y}lt−1_{)), (x}lt_,_ult₎₎

(27)

Atightupper bound on Jh_{(x ) can be defined and computed in} a similar way: upper bound(h, x0,Fn), min [(xlt,ult,rlt,ylt)]T−1_t=0 PT−1 t=0 rlt+ PT−1 t=0 LQT−t∆((y lt−1_,_{h(t, y}lt−1_{)), (x}lt_,_ult₎₎

Why are these bounds tight?Because:

∃C ∈ R+: Jh_{(x ) − lower bound(h, x , F}

n) ≤ C ∗ sparsity of Fn(1)

upper bound(h, x , Fn) − Jh(x ) ≤ C ∗ sparsity of Fn(1)

Functions lower bound(h, x0,Fn) and higher bound(h, x0,Fn)

can be implemented in a “smart way” by seeing the problem as a problem of finding the shortest path in a graph. Complexity linear with T and quadratic with|Fn|.

(28)

Remember what was said about RL + FAs:

2. may lead to unsafe policies - poor performance

guarantees

LetH be a set of candidate high-performance policies. To obtain a policy with good performance guarantees, we suggest to solve the following problem:

h∈ arg max

h∈H

lower bound(h, x0,Fn)

IfH is the set of open-loop policies, solving the above optimization problem can be seen as identifying an “optimal” rebuilt trajectory and outputting as open-loop policy the sequence of actions taken along this rebuilt trajectory.

(29)

•Trajectory set covering the puddle: FQI with trees h∈ arg max

h∈H

lower bound(h, x0,Fn)

•Trajectory set not covering the puddle: FQI with trees h∈ arg max

h∈H

(30)

Remember what was said about RL + FAs:

3. may make suboptimal use of near-optimal

trajectories

Suppose a deterministic batch mode RL problem and that in Fnyou have the elements of the trajectory:

(x₀opt. traj.,u0,r0,x1,u1,r1,x2, . . . ,xT−2,uT−2,rT−2,xT−1,uT−1,rT−1,xT)

where the uts have been selected by an optimal policy.

LetH be the set of open-loop policies. Then, the sequence of actions h∈ arg max

h∈H

lower bound(h, x₀opt. traj.,Fn) is an optimal

one whatever the other trajectories in the set.

Actually, the sequence of action h outputted by this algorithm tends to be anappend of subsequences of actions belonging to optimal trajectories.

(31)

Remember what was said about RL + FAs:

4. offer little clues about how to generate new

experiments in an optimal way

The functions lower bound(h, x0,Fn) and

upper bound(h, x0,Fn) can be exploited for generating new

trajectories.

For example, suppose that you can sample the state-action space several times so as to generate m new elementary pieces of trajectories to enrich your initial setFn. We have

proposed a technique to determine m“interesting” sampling locations based on these bounds.

This technique - which is still preliminary - targets sampling locations that lead to thelargest bound width decreasefor candidate optimal policies.

(32)

Closure

Rebuilding trajectories: interesting concept for solving many problems related to batch mode RL.

Actually, the solution outputted by many RL algorithms (e.g., model-learning with k NN, fitted Q iteration with trees) can be characterized by a set of “rebuilt trajectories”.

⇒

I suspect that this concept of rebuilt trajectories could lead to ageneral paradigm for analyzing and designing RL algorithms.

Batch mode reinforcement learning based on the synthesis of artificial trajectories

Batch mode reinforcement learning

based on the synthesis of artificial

trajectories

Damien Ernst

Batch mode Reinforcement Learning

≃

A typical batch mode RL problem

Batch mode RL and function approximators

Why look beyond function approximators?

1.

not well adapted to risk sensitive performance criteria

2.

may lead to unsafe policies - poor performance

guarantees

3.

may make suboptimal use of near-optimal

trajectories

4.

offer little clues about how to generate new

experiments in an optimal way

Rebuilding trajectories

BLUE ARROW

= elementary piece of trajectory

Set of trajectories

Examples of 5-length

given as input of the

rebuilt trajectories made

Model-Free Monte Carlo Estimator

Context in which the approach is presented

Monte Carlo Estimator

Description of Model-free Monte Carlo Estimator

(MFMC)

Analysis of the MFMC

Assumptions

Characterization of the bias and the variance

Test system

Remember what was said about RL + FAs:

1.

not well adapted to risk sensitive performance criteria

MFMCE in the deterministic case

Computing a lower bound on a policy

Remember what was said about RL + FAs:

2.

may lead to unsafe policies - poor performance

guarantees

Remember what was said about RL + FAs:

3.

may make suboptimal use of near-optimal

trajectories

Remember what was said about RL + FAs:

4.

offer little clues about how to generate new

experiments in an optimal way

Closure

⇒

Presentation based on the paper:

Batch mode reinforcement learning based on the

synthesis of artificial trajectories. R. Fonteneau, S.A.

Murphy, L. Wehenkel and D. Ernst. Annals of Operations

Research, Volume 208, Issue 1, September 2013, Pages

383-416.