Batch mode reinforcement learning
based on the synthesis of artificial
trajectories
Damien Ernst
University of Li `ege, BelgiumBatch mode Reinforcement Learning
≃
Learning a high-performance policy for a sequential decision problem where:
• a numerical criterion is used to define the performance of a policy. (An optimal policy is the policy that maximizes this numerical criterion.)
• “the only” (or “most of the”) information available on the sequential decision problem is contained in a set of trajectories.
Batch mode RL stands at the intersection of three worlds: optimization (maximization of the numerical criterion), system/control theory (sequential decision problem) and machine learning (inference from a set of trajectories).
A typical batch mode RL problem
Discrete-time dynamics:
xt+1 = f (xt,ut,wt) t= 0, 1, . . . , T − 1 where xt ∈ X , ut ∈ U
and wt∈ W . wt is drawn at every time step according to Pw(·).
Reward observed after each system transition: rt = ρ(xt,ut,wt) where ρ : X × U × W → R is the reward
function.
Type of policies considered: h: {0, 1, . . . , T − 1} × X → U. Performance criterion: Expected sum of the rewards observed over the T -length horizon
PCh(x ) = Jh(x ) = E w0,...,wT−1
[PT−1
t=0 ρ(xt,h(t, xt), wt)] with x0= x
and xt+1= f (xt,h(t, xt), wt).
Available information: A set ofelementary pieces of trajectoriesFn= {(xl,ul,rl,yl)}nl=1where yl is the state reached after taking action ul in state xl and rl the instantaneous reward associated with the transition. The
Batch mode RL and function approximators
Training function approximators(radial basis functions, neural nets, trees, etc) using the information contained in the set of trajectories is akey elementto most of the resolution schemes for batch mode RL problems with state-action spaces having a large (infinite) number of elements.Two typical uses of FAs for batch mode RL:
• the FAsmodel of the sequential decision problem (in our typical problem f , r and Pw). The model is afterwards
exploited as if it was the real problem to compute a high-performance policy.
• the FAs represent (state-action)value functions which are used in iterative schemes so as to converge to a (state-action) value function from which a
high-performance policy can be computed. Iterative schemes based on the dynamic programming principle (e.g., LSPI, FQI, Q-learning).
Why look beyond function approximators?
FAs based techniques: mature, can successfully solve many real life problemsbut:
1. not well adapted to risk sensitive performance criteria
2. may lead to unsafe policies - poor performance guarantees
3. may make suboptimal use of near-optimal trajectories
4. offer little clues about how to generate new experiments in an optimal way
1.
not well adapted to risk sensitive performance criteria
An example of risk sensitive performance criterion:
PCh(x ) =
(
−∞ if P(PT−1
t=0 ρ(xt,h(t, xt), wt) < b) > c
Jh(x ) otherwise.
FAs with dynamic programming: very problematic because
(state-action) value functions need to become functions that take as values “probability distributions of future rewards” and not “expected rewards”.
FAs with model learning: more likely to succeed; but what
about the challenges of fitting the FAs to model the distribution of future states reached (rewards collected) by policies and not only an average behavior?
2.
may lead to unsafe policies - poor performance
guarantees
•Benchmark: •Trajectory set •Trajectory set not puddle world covering the puddle covering the puddle
•RL algorithms: ⇒ Optimal policy ⇒ Suboptimal
FQI with trees (unsafe) policy
Typical performance guarantee in the deterministic case for FQI = (estimated return by FQI of the policy it outputsminus
constant×(’size’ of the largest area of the state space not covered by the sample)).
3.
may make suboptimal use of near-optimal
trajectories
Suppose a deterministic batch mode RL problem and that in the set of trajectory, there is a trajectory:
(x0opt. traj.,u0,r0,x1,u1,r1,x2, . . . ,xT−2,uT−2,rT−2,xT−1,uT−1,rT−1,xT)
where the uts have been selected by an optimal policy.
Question:Which batch mode RL algorithms will output a policy which is optimal for the initial state x0opt. traj.whatever the other trajectories in the set?Answer:Not that many and certainly not those using parametric FAs.
In my opinion: batch mode RL algorithms can only be successful on large-scale problems if(i) in the set of trajectories, many trajectories have been generated by (near-)optimal policies(ii)the algorithms exploit very well the information contained in those (near-)optimal trajectories.
4.
offer little clues about how to generate new
experiments in an optimal way
Many real-life problems are variants of batch mode RL
problems for which (a limited number of) additional trajectories can be generated (under various constraints) to enrich the initial set of trajectories.
Question:How should these new trajectories be generated? Many approaches based on the analysis of the FAs produced by batch mode RL methods have been proposed; results are mixed.
Rebuilding trajectories
We conjecture that mapping the set of trajectories intoFAs
generally lead to theloss of essential informationfor
addressing these four issues⇒We have developed a new line of research for solving batch mode RL that does not use at all FAs.
Line of research articulated around thethe rebuilding of artificial (likely “broken”) trajectoriesby using the set of trajectories input of the batch mode RL problem; a rebuilt trajectory is defined by the elementary pieces of trajectory it is made of.
The rebuilt trajectories are analysed to compute various things: a high-performance policy, performance guarantees, where to sample, etc.
BLUE ARROW
= elementary piece of trajectory
Set of trajectories
Examples of 5-length
given as input of the
rebuilt trajectories made
Model-Free Monte Carlo Estimator
Building anoraclethat estimates the performance of a policy: important problem in batch mode RL.
Indeed, if an oracle is available, problem of estimating a high-performance policy can bereduced to an optimization
problemover a set of candidate policies.
If a model of sequential decision problem is available, aMonte Carlo estimator(i.e., rollouts) can be used to estimate the performance of a policy.
We detail an approach that estimates the performance of a policy by rebuilding trajectories so as tomimic the behavior of
Context in which the approach is presented
Discrete-time dynamics:
xt+1 = f (xt,ut,wt) t= 0, 1, . . . , T − 1 where xt ∈ X , ut ∈ U
and wt∈ W . wt is drawn at every time step according to Pw(·)
Reward observed after each system transition: rt = ρ(xt,ut,wt) where ρ : X × U × W → R is the reward
function.
Type of policies considered: h: {0, 1, . . . , T − 1} × X → U. Performance criterion: Expected sum of the rewards observed over the T -length horizon
PCh(x ) = Jh(x ) = E w0,...,wT−1
[PT−1
t=0 ρ(xt,h(t, xt), wt)] with x0= x
and xt+1= f (xt,h(t, xt), wt).
Available information: A set ofelementary pieces of
trajectoriesFn= {(xl,ul,rl,yl)}nl=1. f , ρ and Pw are unknown.
Approach aimed at estimating Jh(x ) from F n.
Monte Carlo Estimator
Generate nbTrajT -length trajectories by simulating the system
starting from the initial state x0; for every trajectory compute the
sum of rewards collected; average these sum of rewards over the nbTraj trajectories to get an estimateMCEh(x0)of Jh(x0).
trajectory 2 r1 r2 r0 r4 r3= r (x3,h(3,x3),w3) trajectory 1
sum rew.traj.1=P4i=0ri
MCEh(x 0) = 13
P3
i=1sum rew.traj.i w3∼ Pw(·)x 4= f (x3,h(3,x3),w3) x3 Illustration with and T= 5 nbTraj= 3 x0 trajectory 3 Bias MCEh(x 0)= E
nbTraj∗T rand. var . w ∼Pw(·)
[MCEh(x
0) − Jh(x0)]= 0
Var. MCEh(x
Description of Model-free Monte Carlo Estimator
(MFMC)
Principle: To rebuild nbTraj T -length trajectories using the
elements of the setFnand to average the sum of rewards
collected along the rebuilt trajectories to get an estimate
MFMCEh(F
n,x0) of Jh(x0).
Trajectories rebuilding algorithm: Trajectories are
sequentially rebuilt; an elementary piece of trajectory can only be used once; trajectories are grown in length by selecting at every instant t= 0, 1, . . . , T − 1 the elementary piece of trajectory(x , u, r , y ) that minimizes the
distance∆((x , u), (xend,h(t, xend)))
where xend is the ending state of the already rebuilt part of the
trajectory (xend = x
Remark: When sequentially selecting the pieces of
trajectories, no information on the value of the disturbance w “behind” the new piece of elementary trajectory
(x , u, r = ρ(x , u, w), y = f (x , u, w)) that is going to be selected is given if only(x , u) and the previous elementary pieces of trajectories selected are known. Important for having a meaningful estimator!!! rebuilt trajectory 2 r3 r21 r9 rebuilt trajectory 1 x0 rebuilt trajectory 3 r18 Illustration with T= 5 and (x18 ,r 18 ,u 18 ,y 18)
sum rew.re.traj.1= r3+ r18+ r21+ r7+ r9
MFMCEh(Fn,x0) = 1 3
P3
i=1sum rew.re.traj.i nbTraj= 3, r7 F24= {(xl,r l ,u l ,y l)}24 l=1
Analysis of the MFMC
Random set ˜Fndefined as follows:Made of n elementary pieces of trajectory where the first two components of an element(xl
,ul) are given by the first two element of the lth element ofFnand the last two are generated
by drawing for each l a disturbance signal wl at random from
PW(·) and taking rl = ρ(xl,ul,wl) and yl = f (xl,ul,wl). Fnis arealization of the random set ˜Fn.
Biasandvarianceof MFMCE defined as: Bias MFMCEh( ˜F n,x0) = E w1,...,wn∼P[MFMCEw h( ˜F n,x0) − Jh(x0)] Var . MFMCEh( ˜F n,x0) = E w1,...,wn∼P w [(MFMCEh( ˜F n,x0) − E w1,...,wn∼P w [MFMCEh( ˜F n,x0)])2]
We providebounds of the bias and variance of this estimator.
Assumptions
1] The functions f , ρ and h areLipschitz continuous: ∃Lf,Lρ,Lh∈ R +: ∀(x , x′,u, u′,w) ∈ X2× U2× W ; kf (x , u, w) − f (x′,u′,w)k X ≤ Lf(kx − x′kX+ ku − u′kU) |ρ(x , u, w) − ρ(x′,u′,w)| ≤ L ρ(kx − x ′k X + ku − u′kU) kh(t, x ) − h(t, x′)k U≤ Lhkx − x′k ∀t ∈ {0, 1, . . . , T − 1}.
2] Thedistance∆is chosen such that: ∆((x , u), (x′,u′)) = (kx − x′k
Characterization of the bias and the variance
Theorem. Bias MFMCEh( ˜F n,x0) ≤ C ∗ sparsity of Fn(nbTraj∗ T ) Var . MFMCEh( ˜F n,x0) ≤ (pVar. MCEh(x 0) + 2C ∗ sparsity of Fn(nbTraj∗ T ))2 with C= Lρ PT−1 t=0 PT−t−1i=0 [Lf(1 + Lh)]i and with the
sparsity of Fn(k ) defined as the minimal radius r such that all
balls in X× U of radius r contain at least k state-action pairs (xl,ul) of the set F
Test system
Discrete-time dynamics: xt+1= sin(π2(xt+ ut+ wt)) with
X = [−1, 1], U = [−12,12], W = [−0.12 ,−0.12 ] and Pw(·) a uniform
pdf.
Reward observed after each system transition: rt =2π1e− 1 2(x 2 t+u2t)+ w t
Performance criterion: Expected sum of the rewards observed over a 15-length horizon (T= 15).
We want to evaluate the performance of the policy
h(t, x ) = −x
Simulations for nbTraj= 10 and size of Fn= 100, ..., 10000.
Simulations for nbTraj = 1, . . . , 100 and size of Fn= 10, 000.
Remember what was said about RL + FAs:
1.
not well adapted to risk sensitive performance criteria
Suppose the risk sensitive performance criterion:
PCh(x ) = ( −∞ if P(Jh(x ) < b) > c Jh(x ) otherwise where Jh(x ) = E[PT−1 t=0 ρ(xt,h(t, xt), wt)].
MFMCE adapted to this performance criterion:
Rebuilt nbTrajstarting from x0using the setFnas done with the
MFMCE estimator. Let sum rew traj i be the sum of rewards collected along the ith trajectory. Output as estimation of PCh(x 0) : −∞ if PnbTraj
i=1 I{sum rew traj i<b}
nbTraj >c
PnbTraj
i=1
sum rew traj i
MFMCE in the deterministic case
We consider from now on that: xt+1 = f (xt,ut) and
rt = ρ(xt,ut).
One single trajectory is sufficient to compute exactly Jh(x 0) by
Monte Carlo estimation.
Theorem.Let[(xlt,ult,rlt,ylt)]T−1
t=0 be the trajectory rebuilt by
the MFMCE when using the distance measure ∆((x , u), (x′
,u′)) = kx − x′k + ku − u′k. If f , ρ and h are Lipschitz continuous, we have
|MFMCEh(x0) − Jh(x0)|≤ T−1 X t=0 LQT−t∆((y lt−1,h(t, ylt−1)), (xlt ,ult)) where yl−1 = x 0and LQN = Lρ( PN−1 t=0 [Lf(1 + Lh)]t).
Previous theorem extends to whatever rebuilt trajectory:
Theorem.Let[(xlt,ult,rlt,ylt)]T−1
t=0 be any rebuilt trajectory. If f ,
ρand h are Lipschitz continuous, we have | T−1 X t=0 rlt − J h(x 0)| ≤ T−1 X t=0 LQT−t∆((y lt−1,h(t, ylt−1)), (xlt,ult)) where∆((x , u), (x′,u′)) = kx − x′k + ku − u′k, yl−1 = x 0and LQN = Lρ( PN−1 t=0 [Lf(1 + Lh)]t). r2= r (x2,h(2,x2)) r3
trajectory generated by policy h
∆2= LQ3(kyl1− xl2k + kh(2,yl1) − ul2k) rl4 rl1 xl2,ul2 |P4 t=0rt−P4t=0rlt| ≤ P4 t=0∆t r1 r0 x0 r5 rl0 r l2 rl3 yl1
Computing a lower bound on a policy
From previous theorem, we have for any rebuilt trajectory [(xlt,ult,rlt,ylt)]T−1 t=0 : Jh(x0) ≥ T−1 X t=0 rlt− T−1 X t=0 LQT−t∆((y lt−1,h(t, ylt−1)), (xlt,ult))
This suggests to find the rebuilt trajectory that maximizes the right-hand side of the inequality to compute atightlower bound on h. Let: lower bound(h, x0,Fn), max [(xlt,ult,rlt,ylt)]T−1 t=0 PT−1 t=0 rlt− PT−1 t=0 LQT−t∆((y lt−1,h(t, ylt−1)), (xlt,ult))
Atightupper bound on Jh(x ) can be defined and computed in a similar way: upper bound(h, x0,Fn), min [(xlt,ult,rlt,ylt)]T−1t=0 PT−1 t=0 rlt+ PT−1 t=0 LQT−t∆((y lt−1,h(t, ylt−1)), (xlt,ult))
Why are these bounds tight?Because:
∃C ∈ R+: Jh(x ) − lower bound(h, x , F
n) ≤ C ∗ sparsity of Fn(1)
upper bound(h, x , Fn) − Jh(x ) ≤ C ∗ sparsity of Fn(1)
Functions lower bound(h, x0,Fn) and higher bound(h, x0,Fn)
can be implemented in a “smart way” by seeing the problem as a problem of finding the shortest path in a graph. Complexity linear with T and quadratic with|Fn|.
Remember what was said about RL + FAs:
2.
may lead to unsafe policies - poor performance
guarantees
LetH be a set of candidate high-performance policies. To obtain a policy with good performance guarantees, we suggest to solve the following problem:
h∈ arg max
h∈H
lower bound(h, x0,Fn)
IfH is the set of open-loop policies, solving the above optimization problem can be seen as identifying an “optimal” rebuilt trajectory and outputting as open-loop policy the sequence of actions taken along this rebuilt trajectory.
•Trajectory set covering the puddle: FQI with trees h∈ arg max
h∈H
lower bound(h, x0,Fn)
•Trajectory set not covering the puddle: FQI with trees h∈ arg max
h∈H
Remember what was said about RL + FAs:
3.
may make suboptimal use of near-optimal
trajectories
Suppose a deterministic batch mode RL problem and that in Fnyou have the elements of the trajectory:
(x0opt. traj.,u0,r0,x1,u1,r1,x2, . . . ,xT−2,uT−2,rT−2,xT−1,uT−1,rT−1,xT)
where the uts have been selected by an optimal policy.
LetH be the set of open-loop policies. Then, the sequence of actions h∈ arg max
h∈H
lower bound(h, x0opt. traj.,Fn) is an optimal
one whatever the other trajectories in the set.
Actually, the sequence of action h outputted by this algorithm tends to be anappend of subsequences of actions belonging to optimal trajectories.
Remember what was said about RL + FAs:
4.
offer little clues about how to generate new
experiments in an optimal way
The functions lower bound(h, x0,Fn) and
upper bound(h, x0,Fn) can be exploited for generating new
trajectories.
For example, suppose that you can sample the state-action space several times so as to generate m new elementary pieces of trajectories to enrich your initial setFn. We have
proposed a technique to determine m“interesting” sampling locations based on these bounds.
This technique - which is still preliminary - targets sampling locations that lead to thelargest bound width decreasefor candidate optimal policies.
Closure
Rebuilding trajectories: interesting concept for solving many problems related to batch mode RL.
Actually, the solution outputted by many RL algorithms (e.g., model-learning with k NN, fitted Q iteration with trees) can be characterized by a set of “rebuilt trajectories”.