Supervised learning based sequential decision
making
Damien Ernst
Department of Electrical Engineering and Computer Science University of Li`ege
ICOPI - October 19-21, 2005
Find slides:http://montefiore.ulg.ac.be/∼ernst/
Problem addressed in this talk
How can we design algorithms able to extract from experience good sequentialdecision making policies?
⇒ Discuss different algorithms exploitingbatch-mode supervised learning.
⇒ Illustrate one of these algorithms, named fitted Q iteration, on an academic power systems example
NB. I Many practical problems are concerned (medical applications,
robot control, finance, ...)
I Most are related to complex and uncertain environments I To simplify derivations, we restrict to the deterministic case
Batch-mode supervised learning
(Reminder)I From sample ls = (x,y)N of N observations (inputs,outputs)
I Compute a model ˆy(x) = fls(x) (decision tree, MLP, ...)
Batch-mode Supervised Learning a1 No No Yes Yes < 31 8 a< 5 Y1 Y3 Y2 Y 2 10 234 0 0 18 5 7 3 Y1 Y1 Y2 Y2 Y3 Y1 60 60 2 75 10 10 10 10 2 14 5 8 8 9 1 1 1 1 3 3 11 23 46 29 0 17 18 19 22 23 9 77 0 0 0 3 1 2 ? 65 0 2 20 21 3 0 1 1 a a2 3 a4 a5 a6 a7 a8 1 a
Learning a
static
decision policy
(Problem statement and solution)Given state information x and a set of possible decisions d, learn an
approximation of the policy d∗
(x) maximizing the reward r (x, d).
To learn, let’s assume avaible a sample of N elements of the type (x, d, r ).
Case 1: learn from a sample ofoptimal decisions (x,d∗
)N: ⇒ direct application of SL to get ˆd∗
(x).
Learning a
static
decision policy
(Problem statement and solution)Given state information x and a set of possible decisions d, learn an
approximation of the policy d∗
(x) maximizing the reward r (x, d).
To learn, let’s assume avaible a sample of N elements of the type (x, d, r ).
Case 1: learn from a sample ofoptimal decisions (x,d∗
)N: ⇒ direct application of SL to get ˆd∗
(x).
Case 2: learn from a sample ofrandomdecisions (x, d,r)N:
⇒ apply SL to get ˆr(x, d) and compute ˆ
d∗
Learning a
sequential
decision policy
(Problem statement)Given initial state information x0, state dynamics xt+1= f (xt, dt),
and instantaneous reward r (x, d), find (d∗
0, . . . , d ∗
h−1) maximizing
the cumulated discounted reward over h stages R(x0, d0, . . . , dh−1) =
h−1
X
t=0
γtr(xt, dt).
Different kinds of optimal policies:
Open loop: d∗
t = d ∗
(x0, t) (OK in the deterministic case)
Closed loop: d∗
t = d ∗
(xt, t) (Also OK in the stochastic case)
Stationary: d∗
t = d∗(xt) (OK in the infinite horizon case)
Learning a
sequential
decision policy
(Problem statement)Given initial state information x0, state dynamics xt+1= f (xt, dt),
and instantaneous reward r (x, d), find (d∗
0, . . . , d ∗
h−1) maximizing
the cumulated discounted reward over h stages R(x0, d0, . . . , dh−1) =
h−1
X
t=0
γtr(xt, dt).
Different kinds of optimal policies:
Open loop: d∗
t = d∗(x0, t) (OK in the deterministic case)
Closed loop: d∗
t = d∗(xt, t) (Also OK in the stochastic case)
Stationary: d∗
t = d∗(xt) (OK in the infinite horizon case)
To learn, let’s assume available a sample of N h-stage trajectories (x0,d0,r0,x1,d1,r1, . . . ,xh−1,dh−1,rh−1,xh)N
Learning a sequential decision policy
(Case 1, Direct)We first assume that the decisions shown in the sample are the
optimal ones:
Learning open-loop policy: can use openloop parts of sample
(x0,d0∗, . . . , d ∗ h−1)N
⇒ apply SL h times,∀t = 0, . . . , h − 1, to construct ˆ
d∗
(x0, t) from (x0,dt∗)N.
Learning a sequential decision policy
(Case 1, Direct)We first assume that the decisions shown in the sample are the
optimal ones:
Learning open-loop policy: can use openloop parts of sample
(x0,d0∗, . . . , d ∗ h−1)N
⇒ apply SL h times,∀t = 0, . . . , h − 1, to construct ˆ
d∗
(x0, t) from (x0,dt∗)N.
Learning closed-loop policy: should useclosed loop parts of sample
(x0, x1, . . . , xh−1,d0∗, . . . , d ∗ h−1)N
⇒ apply SL h times, ∀t = 0, . . . , h − 1, to construct ˆ
d∗
Learning a sequential decision policy
(Case 1, Direct)We first assume that the decisions shown in the sample are the
optimal ones:
Learning open-loop policy: can use openloop parts of sample
(x0,d0∗, . . . , d ∗ h−1)N
⇒ apply SL h times,∀t = 0, . . . , h − 1, to construct ˆ
d∗
(x0, t) from (x0,dt∗)N.
Learning closed-loop policy: should useclosed loop parts of sample
(x0, x1, . . . , xh−1,d0∗, . . . , d ∗ h−1)N
⇒ apply SL h times, ∀t = 0, . . . , h − 1, to construct ˆ
d∗
(xt, t) from (xt,dt∗)N.
Possible discussion: for the stochastic case, the closed-loop approach still holds valid.
Learning a sequential decision policy
(Case 2, Naive aproach)Problem: How to generalize this if the decisions shown in the
sample are random(i.e. not necessarily the optimal ones)?
Brute force approach: one could use open loop parts of sample
(x0, d0, . . . , dh−1,R)N
⇒ apply SL to construct ˆR(x0, d0, . . . , dh−1)
⇒ compute (d0, . . . , dh−1)∗(x) =
Learning a sequential decision policy
(Case 2, Naive aproach)Problem: How to generalize this if the decisions shown in the
sample are random(i.e. not necessarily the optimal ones)?
Brute force approach: one could use open loop parts of sample
(x0, d0, . . . , dh−1,R)N
⇒ apply SL to construct ˆR(x0, d0, . . . , dh−1)
⇒ compute (d0, . . . , dh−1)∗(x) =
arg maxd0,...,dh−1R(xˆ 0, d0, . . . , dh−1).
Possible discussion: computational complexity of arg max; sample complexity; what if system is stochastic...
Learning a sequential decision policy
(Case 2, Classical approach)Model based approach:
I Exploit sample of state transitions (xt
i, dti,xti+1)N×h
⇒ use SL to build model of dynamics ˆf(x, d).
I Exploit sample of instantaneous rewards (xt
i, dti,rti)N×h
⇒ use SL to build model of reward function ˆr(x, d).
I Use dynamic programming algorithms (e.g. value iteration) to
compute from ˆf(x, d) and ˆr(x, d) the optimal policy ˆd∗
Learning a sequential decision policy
(Case 2, Classical approach)Model based approach:
I Exploit sample of state transitions (x
ti, dti,xti+1) N×h
⇒ use SL to build model of dynamics ˆf(x, d).
I Exploit sample of instantaneous rewards (xt
i, dti,rti) N×h
⇒ use SL to build model of reward function ˆr(x, d).
I Use dynamic programming algorithms (e.g. value iteration) to
compute from ˆf(x, d) and ˆr(x, d) the optimal policy ˆd∗
(x, t).
Possible discussion: can be modified to work in the stochastic case, makes good use of sample information; exploits stationary problem structure; has computational complexity of DP, therefore is problematic in continuous or high-dimensional state spaces ...
Learning a sequential decision policy
(Case 2,Proposed approach)Interlacing batch-mode SL and backward value-iteration:
I Assume we are in a certain state x at time h − 1 (last stage):
⇒ define Q1(x, d) ≡ r (x, d) ⇒ optimal decision: d∗ (x, h − 1) = arg maxdQ1(x, d) ⇒ SL on (xti, dti,rti) N×h to compute ˆQ 1(x, d) + arg max...
Learning a sequential decision policy
(Case 2,Proposed approach)Interlacing batch-mode SL and backward value-iteration:
I Assume we are in a certain state x at time h − 1 (last stage):
⇒ define Q1(x, d) ≡ r (x, d) ⇒ optimal decision: d∗ (x, h − 1) = arg maxdQ1(x, d) ⇒ SL on (xti, dti,rti) N×h to compute ˆQ 1(x, d) + arg max...
I Assume we are in state x at time h − 2:
⇒ define Q2(x, d) = r (x, d) + γ arg maxd0Q1(f (x, d), d0)
⇒ optimal decision to take: d∗
(x, h − 2) = arg maxdQ2(x, d) ⇒ SL on (xti, dti,rti + γ arg maxd0Qˆ1(xti+1, d 0 ))N×h ⇒ ˆQ2(x, d) + arg max ⇒ ˆd∗ (x, h − 2)
Learning a sequential decision policy
(Case 2,Proposed approach)Interlacing batch-mode SL and backward value-iteration:
I Assume we are in a certain state x at time h − 1 (last stage):
⇒ define Q1(x, d) ≡ r (x, d) ⇒ optimal decision: d∗ (x, h − 1) = arg maxdQ1(x, d) ⇒ SL on (xti, dti,rti) N×h to compute ˆQ 1(x, d) + arg max...
I Assume we are in state x at time h − 2:
⇒ define Q2(x, d) = r (x, d) + γ arg maxd0Q1(f (x, d), d0)
⇒ optimal decision to take: d∗
(x, h − 2) = arg maxdQ2(x, d) ⇒ SL on (xti, dti,rti + γ arg maxd0Qˆ1(xti+1, d 0 ))N×h ⇒ ˆQ2(x, d) + arg max ⇒ ˆd∗ (x, h − 2)
I Continue h − 2 further times to yield sequence of ˆQi-functions
and policy approximations ˆd∗
Learning a sequential decision policy
(Case 2,Proposed approach)Interlacing batch-mode SL and backward value-iteration:
I Assume we are in a certain state x at time h − 1 (last stage):
⇒ define Q1(x, d) ≡ r (x, d) ⇒ optimal decision: d∗ (x, h − 1) = arg maxdQ1(x, d) ⇒ SL on (xti, dti,rti) N×h to compute ˆQ 1(x, d) + arg max...
I Assume we are in state x at time h − 2:
⇒ define Q2(x, d) = r (x, d) + γ arg maxd0Q1(f (x, d), d0)
⇒ optimal decision to take: d∗
(x, h − 2) = arg maxdQ2(x, d) ⇒ SL on (xti, dti,rti + γ arg maxd0Qˆ1(xti+1, d 0 ))N×h ⇒ ˆQ2(x, d) + arg max ⇒ ˆd∗ (x, h − 2)
I Continue h − 2 further times to yield sequence of ˆQi-functions
and policy approximations ˆd∗
(x, h − i ), ∀t = 1, . . . , h.
I This algorithm is called “Fitted Q iteration”.
Fitted Q iteration: discussion
I All what the algorithm actually needs to work is a sample of
four-tuples (xti, dti, rti, xti+1)N, and a a good supervised
learning algorithm (for least squares regression).
I It has be shown to give excellent results when using ensemble
of regression trees as supervised learning algorithms
I It can cope (without modification) with stochastic problems.
I It can exploit efficiently very large samples.
I It already proved to work well on several complex continuous
state space problems.
I Iterating it sufficiently many times, it yields an approximation
of the optimal (closed-loop, and stationary) infinite horizon decision policy.
Illustration: power system control
Infinite Bus System V ∠0 E ∠δ Electrical power (Pe)System reactance Variable reactance jXsystem ju Generator 10. 5. 0.0 −5. −10. −1. −.5 0.0 0.5 1. 1.5 2. ω δ 2.5 2. 1.5 1. 0.5 −.5 0.0 −1. 12.5 10. 7.5 5. 2.5 0.0 Pm Pe t(s)
The sequential decision problem
I Two state variables: δ and ω.
I Time discretization: time between t and t + 1 equal to 50 ms
I Two possible decisions d: capacitance set to zero or
capacitance set to its maximal value.
I r(xt, ut, wt) = −|Pe t+1− Pm| if xt+1∈ stability domain and
−100 otherwise
I γ = 0.98
I Sequential decision problem such that the optimal stationary
Representation of ˆ
d
∗(x
, t
) (h = 200), 10,000 four-tuples
ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.5 2. ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.5 2. ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.5 2. ˆ d∗(x, 200 − 1) dˆ∗(x, 200 − 5) dˆ∗(x, 200 − 10) ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.5 2. ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.5 2. ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.5 2. ˆ d∗(x, 200 − 20) dˆ∗(x, 200 − 50) dˆ∗(x, 0)2.5 2. 1.5 1. 0.5 0.0 −.5 −1. 0.0 2.5 5. 7.5 10. 12.5 10. 5. 0.0 −5. −10. −1.−.50.00.51. 1.52. δ ω Pe t(s) ˆ d∗(x) = ˆd∗(x, 0) used to control the system (h = 200) 10,000 samples 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.52. ω δ ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.52. 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.52. ω δ ˆ d∗(x, 0) 100,000 samples ˆ d∗(x, 0) 10,000 samples ˆ d∗ (x, 0) 1000 samples
Finish
I Fitted Q iteration algorithm combined with ensemble of
regression trees has been evaluated on several problems and
was constantly givingsecond to none performances.
I Why has not this algorithm been proposed before ?