Supervised learning based sequential decision making

(1)

Supervised learning based sequential decision

making

Damien Ernst

Department of Electrical Engineering and Computer Science University of Li`ege

ICOPI - October 19-21, 2005

Find slides:http://montefiore.ulg.ac.be/∼ernst/

(2)

Problem addressed in this talk

How can we design algorithms able to extract from experience good sequentialdecision making policies?

⇒ Discuss different algorithms exploitingbatch-mode supervised learning.

⇒ Illustrate one of these algorithms, named fitted Q iteration, on an academic power systems example

NB. I Many practical problems are concerned (medical applications,

robot control, finance, ...)

I Most are related to complex and uncertain environments I To simplify derivations, we restrict to the deterministic case

(3)

Batch-mode supervised learning

(Reminder)

I From sample ls = (x,y)N of N observations ₍_inputs_,_outputs₎

I Compute a model ˆy(x) = fls(x) _{(decision tree, MLP, ...)}

Batch-mode Supervised Learning a1 No No Yes Yes < 31 8 a< 5 Y1 Y3 Y2 Y 2 10 234 0 0 18 5 7 3 Y1 Y1 Y2 Y2 Y3 Y1 60 60 2 75 10 10 10 10 2 14 5 8 8 9 1 1 1 1 3 3 11 23 46 29 0 17 18 19 22 23 9 77 0 0 0 3 1 2 ? 65 0 2 20 21 3 0 1 1 a a2 3 a4 a5 a6 a7 a8 1 a

(4)

Learning a

static

decision policy

(Problem statement and solution)

Given state information x and a set of possible decisions d, learn an

approximation of the policy d∗

(x) maximizing the reward r (x, d).

To learn, let’s assume avaible a sample of N elements of the type (x, d, r ).

Case 1: learn from a sample ofoptimal decisions (x,d∗

)N: ⇒ direct application of SL to get ˆd∗

(x).

(5)

Learning a

static

decision policy

(Problem statement and solution)

Given state information x and a set of possible decisions d, learn an

approximation of the policy d∗

(x) maximizing the reward r (x, d).

To learn, let’s assume avaible a sample of N elements of the type (x, d, r ).

Case 1: learn from a sample ofoptimal decisions (x,d∗

)N: ⇒ direct application of SL to get ˆd∗

(x).

Case 2: learn from a sample ofrandomdecisions (x, d,r)N_:

⇒ apply SL to get ˆr(x, d) and compute ˆ

d∗

(6)

Learning a

sequential

decision policy

(Problem statement)

Given initial state information x0, state dynamics xt+1= f (xt, dt),

and instantaneous reward r (x, d), find (d∗

0, . . . , d ∗

h−1) maximizing

the cumulated discounted reward over h stages R(x0, d0, . . . , dh−1) =

h−1

X

t=0

γtr(xt, dt).

Different kinds of optimal policies:

Open loop: d∗

t = d ∗

(x0, t) (OK in the deterministic case)

Closed loop: d∗

t = d ∗

(xt, t) (Also OK in the stochastic case)

Stationary: d∗

t = d∗(xt) (OK in the infinite horizon case)

(7)

Learning a

sequential

decision policy

(Problem statement)

Given initial state information x0, state dynamics xt+1= f (xt, dt),

and instantaneous reward r (x, d), find (d∗

0, . . . , d ∗

h−1) maximizing

the cumulated discounted reward over h stages R(x0, d0, . . . , dh−1) =

h−1

X

t=0

γtr(xt, dt).

Different kinds of optimal policies:

Open loop: d∗

t = d∗(x0, t) (OK in the deterministic case)

Closed loop: d∗

t = d∗(xt, t) (Also OK in the stochastic case)

Stationary: d∗

t = d∗(xt) (OK in the infinite horizon case)

To learn, let’s assume available a sample of N h-stage trajectories (x0,d0,r0,x1,d1,r1, . . . ,xh−1,dh−1,rh−1,xh)N

(8)

Learning a sequential decision policy

(Case 1, Direct)

We first assume that the decisions shown in the sample are the

optimal ones:

Learning open-loop policy: can use openloop parts of sample

(x0,d0∗, . . . , d ∗ h−1)N

⇒ apply SL h times,∀t = 0, . . . , h − 1, to construct ˆ

d∗

(x0, t) from (x0,dt∗)N.

(9)

Learning a sequential decision policy

(Case 1, Direct)

optimal ones:

(x0,d0∗, . . . , d ∗ h−1)N

d∗

Learning closed-loop policy: should useclosed loop parts of sample

(x0, x1, . . . , xh−1,d0∗, . . . , d ∗ h−1)N

⇒ apply SL h times, ∀t = 0, . . . , h − 1, to construct ˆ

d∗

(10)

Learning a sequential decision policy

(Case 1, Direct)

optimal ones:

(x0,d0∗, . . . , d ∗ h−1)N

d∗

Learning closed-loop policy: should useclosed loop parts of sample

(x0, x1, . . . , xh−1,d0∗, . . . , d ∗ h−1)N

⇒ apply SL h times, ∀t = 0, . . . , h − 1, to construct ˆ

d∗

(xt, t) from (xt,dt∗)N.

Possible discussion: for the stochastic case, the closed-loop approach still holds valid.

(11)

Learning a sequential decision policy

(Case 2, Naive aproach)

Problem: How to generalize this if the decisions shown in the

sample are random(i.e. not necessarily the optimal ones)?

Brute force approach: one could use open loop parts of sample

(x0, d0, . . . , dh−1,R)N

⇒ apply SL to construct ˆR(x0, d0, . . . , dh−1)

⇒ compute (d0, . . . , dh−1)∗(x) =

(12)

Learning a sequential decision policy

(Case 2, Naive aproach)

Problem: How to generalize this if the decisions shown in the

sample are random(i.e. not necessarily the optimal ones)?

Brute force approach: one could use open loop parts of sample

(x0, d0, . . . , dh−1,R)N

⇒ apply SL to construct ˆR(x0, d0, . . . , dh−1)

⇒ compute (d0, . . . , dh−1)∗(x) =

arg maxd0,...,dh−1R(xˆ 0, d0, . . . , dh−1).

Possible discussion: computational complexity of arg max; sample complexity; what if system is stochastic...

(13)

Learning a sequential decision policy

(Case 2, Classical approach)

Model based approach:

I Exploit sample of state transitions (x_t

i, dti,xti+1)N×h

⇒ use SL to build model of dynamics ˆf(x, d).

I Exploit sample of instantaneous rewards (x_t

i, dti,rti)N×h

⇒ use SL to build model of reward function ˆr(x, d).

I Use dynamic programming algorithms (e.g. value iteration) to

compute from ˆf(x, d) and ˆr(x, d) the optimal policy ˆd∗

(14)

Learning a sequential decision policy

(Case 2, Classical approach)

Model based approach:

I Exploit sample of state transitions (x

ti, dti,xti+1) N×h

⇒ use SL to build model of dynamics ˆf(x, d).

I Exploit sample of instantaneous rewards (x_t

i, dti,rti) N×h

⇒ use SL to build model of reward function ˆr(x, d).

I Use dynamic programming algorithms (e.g. value iteration) to

compute from ˆf(x, d) and ˆr(x, d) the optimal policy ˆd∗

(x, t).

Possible discussion: can be modified to work in the stochastic case, makes good use of sample information; exploits stationary problem structure; has computational complexity of DP, therefore is problematic in continuous or high-dimensional state spaces ...

(15)

Learning a sequential decision policy

(Case 2,Proposed approach)

Interlacing batch-mode SL and backward value-iteration:

I Assume we are in a certain state x at time h − 1 (last stage):

⇒ define Q1(x, d) ≡ r (x, d) ⇒ optimal decision: d∗ (x, h − 1) = arg maxdQ1(x, d) ⇒ SL on (xti, dti,rti) N×h _{to compute ˆ}_Q 1(x, d) + arg max...

(16)

Learning a sequential decision policy

I Assume we are in state x at time h − 2:

⇒ define Q2(x, d) = r (x, d) + γ arg maxd0Q₁(f (x, d), d0)

⇒ optimal decision to take: d∗

(x, h − 2) = arg maxdQ2(x, d) ⇒ SL on (xti, dti,rti + γ arg maxd0Qˆ1(xti+1, d 0 ))N×h ⇒ ˆQ₂(x, d) + arg max ⇒ ˆd∗ (x, h − 2)

(17)

Learning a sequential decision policy

I Continue h − 2 further times to yield sequence of ˆQ_i-functions

and policy approximations ˆd∗

(18)

Learning a sequential decision policy

I Continue h − 2 further times to yield sequence of ˆQ_i-functions

and policy approximations ˆd∗

(x, h − i ), ∀t = 1, . . . , h.

I This algorithm is called “Fitted Q iteration”.

(19)

Fitted Q iteration: discussion

I All what the algorithm actually needs to work is a sample of

four-tuples (xti, dti, rti, xti+1)N, and a a good supervised

learning algorithm (for least squares regression).

I It has be shown to give excellent results when using ensemble

of regression trees as supervised learning algorithms

I It can cope (without modification) with stochastic problems.

I It can exploit efficiently very large samples.

I It already proved to work well on several complex continuous

state space problems.

I Iterating it sufficiently many times, it yields an approximation

of the optimal (closed-loop, and stationary) infinite horizon decision policy.

(20)

Illustration: power system control

Infinite Bus System V ∠0 E ∠δ Electrical power (Pe)

System reactance Variable reactance jXsystem ju Generator 10. 5. 0.0 −5. −10. −1. −.5 0.0 0.5 1. 1.5 2. ω δ 2.5 2. 1.5 1. 0.5 −.5 0.0 −1. 12.5 10. 7.5 5. 2.5 0.0 Pm Pe t(s)

(21)

The sequential decision problem

I Two state variables: δ and ω.

I Time discretization: time between t and t + 1 equal to 50 ms

I Two possible decisions d: capacitance set to zero or

capacitance set to its maximal value.

I r(x_t, u_t, w_t) = −|P_{e t+1}− P_m| if x_t+1∈ stability domain and

−100 otherwise

I γ = 0.98

I Sequential decision problem such that the optimal stationary

(22)

Representation of ˆ

d

∗

_(x

, t

) (h = 200), 10,000 four-tuples

ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.5 2. ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.5 2. ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.5 2. ˆ d∗_{(x, 200 − 1)} _dˆ∗_{(x, 200 − 5)} _dˆ∗_{(x, 200 − 10)} ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.5 2. ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.5 2. ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.5 2. ˆ d∗_{(x, 200 − 20)} _dˆ∗_{(x, 200 − 50)} _dˆ∗_{(x, 0)}

(23)

2.5 2. 1.5 1. 0.5 0.0 −.5 −1. 0.0 2.5 5. 7.5 10. 12.5 10. 5. 0.0 −5. −10. −1.−.50.00.51. 1.52. δ ω Pe t(s) ˆ d∗_{(x) = ˆ}_d∗_{(x, 0)} used to control the system (h = 200) 10,000 samples 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.52. ω δ ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.52. 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.52. ω δ ˆ d∗_{(x, 0)} 100,000 samples ˆ d∗_{(x, 0)} 10,000 samples ˆ d∗ (x, 0) 1000 samples

(24)

Finish

I Fitted Q iteration algorithm combined with ensemble of

regression trees has been evaluated on several problems and

was constantly givingsecond to none performances.

I Why has not this algorithm been proposed before ?