• Aucun résultat trouvé

Supervised learning based sequential decision making

N/A
N/A
Protected

Academic year: 2021

Partager "Supervised learning based sequential decision making"

Copied!
24
0
0

Texte intégral

(1)

Supervised learning based sequential decision

making

Damien Ernst

Department of Electrical Engineering and Computer Science University of Li`ege

ICOPI - October 19-21, 2005

Find slides:http://montefiore.ulg.ac.be/∼ernst/

(2)

Problem addressed in this talk

How can we design algorithms able to extract from experience good sequentialdecision making policies?

⇒ Discuss different algorithms exploitingbatch-mode supervised learning.

⇒ Illustrate one of these algorithms, named fitted Q iteration, on an academic power systems example

NB. I Many practical problems are concerned (medical applications,

robot control, finance, ...)

I Most are related to complex and uncertain environments I To simplify derivations, we restrict to the deterministic case

(3)

Batch-mode supervised learning

(Reminder)

I From sample ls = (x,y)N of N observations (inputs,outputs)

I Compute a model ˆy(x) = fls(x) (decision tree, MLP, ...)

Batch-mode Supervised Learning a1 No No Yes Yes < 31 8 a< 5 Y1 Y3 Y2 Y 2 10 234 0 0 18 5 7 3 Y1 Y1 Y2 Y2 Y3 Y1 60 60 2 75 10 10 10 10 2 14 5 8 8 9 1 1 1 1 3 3 11 23 46 29 0 17 18 19 22 23 9 77 0 0 0 3 1 2 ? 65 0 2 20 21 3 0 1 1 a a2 3 a4 a5 a6 a7 a8 1 a

(4)

Learning a

static

decision policy

(Problem statement and solution)

Given state information x and a set of possible decisions d, learn an

approximation of the policy d∗

(x) maximizing the reward r (x, d).

To learn, let’s assume avaible a sample of N elements of the type (x, d, r ).

Case 1: learn from a sample ofoptimal decisions (x,d∗

)N: ⇒ direct application of SL to get ˆd∗

(x).

(5)

Learning a

static

decision policy

(Problem statement and solution)

Given state information x and a set of possible decisions d, learn an

approximation of the policy d∗

(x) maximizing the reward r (x, d).

To learn, let’s assume avaible a sample of N elements of the type (x, d, r ).

Case 1: learn from a sample ofoptimal decisions (x,d∗

)N: ⇒ direct application of SL to get ˆd∗

(x).

Case 2: learn from a sample ofrandomdecisions (x, d,r)N:

⇒ apply SL to get ˆr(x, d) and compute ˆ

d∗

(6)

Learning a

sequential

decision policy

(Problem statement)

Given initial state information x0, state dynamics xt+1= f (xt, dt),

and instantaneous reward r (x, d), find (d∗

0, . . . , d ∗

h−1) maximizing

the cumulated discounted reward over h stages R(x0, d0, . . . , dh−1) =

h−1

X

t=0

γtr(xt, dt).

Different kinds of optimal policies:

Open loop: d∗

t = d ∗

(x0, t) (OK in the deterministic case)

Closed loop: d∗

t = d ∗

(xt, t) (Also OK in the stochastic case)

Stationary: d∗

t = d∗(xt) (OK in the infinite horizon case)

(7)

Learning a

sequential

decision policy

(Problem statement)

Given initial state information x0, state dynamics xt+1= f (xt, dt),

and instantaneous reward r (x, d), find (d∗

0, . . . , d ∗

h−1) maximizing

the cumulated discounted reward over h stages R(x0, d0, . . . , dh−1) =

h−1

X

t=0

γtr(xt, dt).

Different kinds of optimal policies:

Open loop: d∗

t = d∗(x0, t) (OK in the deterministic case)

Closed loop: d∗

t = d∗(xt, t) (Also OK in the stochastic case)

Stationary: d∗

t = d∗(xt) (OK in the infinite horizon case)

To learn, let’s assume available a sample of N h-stage trajectories (x0,d0,r0,x1,d1,r1, . . . ,xh−1,dh−1,rh−1,xh)N

(8)

Learning a sequential decision policy

(Case 1, Direct)

We first assume that the decisions shown in the sample are the

optimal ones:

Learning open-loop policy: can use openloop parts of sample

(x0,d0∗, . . . , d ∗ h−1)N

⇒ apply SL h times,∀t = 0, . . . , h − 1, to construct ˆ

d∗

(x0, t) from (x0,dt∗)N.

(9)

Learning a sequential decision policy

(Case 1, Direct)

We first assume that the decisions shown in the sample are the

optimal ones:

Learning open-loop policy: can use openloop parts of sample

(x0,d0∗, . . . , d ∗ h−1)N

⇒ apply SL h times,∀t = 0, . . . , h − 1, to construct ˆ

d∗

(x0, t) from (x0,dt∗)N.

Learning closed-loop policy: should useclosed loop parts of sample

(x0, x1, . . . , xh−1,d0∗, . . . , d ∗ h−1)N

⇒ apply SL h times, ∀t = 0, . . . , h − 1, to construct ˆ

d∗

(10)

Learning a sequential decision policy

(Case 1, Direct)

We first assume that the decisions shown in the sample are the

optimal ones:

Learning open-loop policy: can use openloop parts of sample

(x0,d0∗, . . . , d ∗ h−1)N

⇒ apply SL h times,∀t = 0, . . . , h − 1, to construct ˆ

d∗

(x0, t) from (x0,dt∗)N.

Learning closed-loop policy: should useclosed loop parts of sample

(x0, x1, . . . , xh−1,d0∗, . . . , d ∗ h−1)N

⇒ apply SL h times, ∀t = 0, . . . , h − 1, to construct ˆ

d∗

(xt, t) from (xt,dt∗)N.

Possible discussion: for the stochastic case, the closed-loop approach still holds valid.

(11)

Learning a sequential decision policy

(Case 2, Naive aproach)

Problem: How to generalize this if the decisions shown in the

sample are random(i.e. not necessarily the optimal ones)?

Brute force approach: one could use open loop parts of sample

(x0, d0, . . . , dh−1,R)N

⇒ apply SL to construct ˆR(x0, d0, . . . , dh−1)

⇒ compute (d0, . . . , dh−1)∗(x) =

(12)

Learning a sequential decision policy

(Case 2, Naive aproach)

Problem: How to generalize this if the decisions shown in the

sample are random(i.e. not necessarily the optimal ones)?

Brute force approach: one could use open loop parts of sample

(x0, d0, . . . , dh−1,R)N

⇒ apply SL to construct ˆR(x0, d0, . . . , dh−1)

⇒ compute (d0, . . . , dh−1)∗(x) =

arg maxd0,...,dh−1R(xˆ 0, d0, . . . , dh−1).

Possible discussion: computational complexity of arg max; sample complexity; what if system is stochastic...

(13)

Learning a sequential decision policy

(Case 2, Classical approach)

Model based approach:

I Exploit sample of state transitions (xt

i, dti,xti+1)N×h

⇒ use SL to build model of dynamics ˆf(x, d).

I Exploit sample of instantaneous rewards (xt

i, dti,rti)N×h

⇒ use SL to build model of reward function ˆr(x, d).

I Use dynamic programming algorithms (e.g. value iteration) to

compute from ˆf(x, d) and ˆr(x, d) the optimal policy ˆd∗

(14)

Learning a sequential decision policy

(Case 2, Classical approach)

Model based approach:

I Exploit sample of state transitions (x

ti, dti,xti+1) N×h

⇒ use SL to build model of dynamics ˆf(x, d).

I Exploit sample of instantaneous rewards (xt

i, dti,rti) N×h

⇒ use SL to build model of reward function ˆr(x, d).

I Use dynamic programming algorithms (e.g. value iteration) to

compute from ˆf(x, d) and ˆr(x, d) the optimal policy ˆd∗

(x, t).

Possible discussion: can be modified to work in the stochastic case, makes good use of sample information; exploits stationary problem structure; has computational complexity of DP, therefore is problematic in continuous or high-dimensional state spaces ...

(15)

Learning a sequential decision policy

(Case 2,Proposed approach)

Interlacing batch-mode SL and backward value-iteration:

I Assume we are in a certain state x at time h − 1 (last stage):

⇒ define Q1(x, d) ≡ r (x, d) ⇒ optimal decision: d∗ (x, h − 1) = arg maxdQ1(x, d) ⇒ SL on (xti, dti,rti) N×h to compute ˆQ 1(x, d) + arg max...

(16)

Learning a sequential decision policy

(Case 2,Proposed approach)

Interlacing batch-mode SL and backward value-iteration:

I Assume we are in a certain state x at time h − 1 (last stage):

⇒ define Q1(x, d) ≡ r (x, d) ⇒ optimal decision: d∗ (x, h − 1) = arg maxdQ1(x, d) ⇒ SL on (xti, dti,rti) N×h to compute ˆQ 1(x, d) + arg max...

I Assume we are in state x at time h − 2:

⇒ define Q2(x, d) = r (x, d) + γ arg maxd0Q1(f (x, d), d0)

⇒ optimal decision to take: d∗

(x, h − 2) = arg maxdQ2(x, d) ⇒ SL on (xti, dti,rti + γ arg maxd0Qˆ1(xti+1, d 0 ))N×h ⇒ ˆQ2(x, d) + arg max ⇒ ˆd∗ (x, h − 2)

(17)

Learning a sequential decision policy

(Case 2,Proposed approach)

Interlacing batch-mode SL and backward value-iteration:

I Assume we are in a certain state x at time h − 1 (last stage):

⇒ define Q1(x, d) ≡ r (x, d) ⇒ optimal decision: d∗ (x, h − 1) = arg maxdQ1(x, d) ⇒ SL on (xti, dti,rti) N×h to compute ˆQ 1(x, d) + arg max...

I Assume we are in state x at time h − 2:

⇒ define Q2(x, d) = r (x, d) + γ arg maxd0Q1(f (x, d), d0)

⇒ optimal decision to take: d∗

(x, h − 2) = arg maxdQ2(x, d) ⇒ SL on (xti, dti,rti + γ arg maxd0Qˆ1(xti+1, d 0 ))N×h ⇒ ˆQ2(x, d) + arg max ⇒ ˆd∗ (x, h − 2)

I Continue h − 2 further times to yield sequence of ˆQi-functions

and policy approximations ˆd∗

(18)

Learning a sequential decision policy

(Case 2,Proposed approach)

Interlacing batch-mode SL and backward value-iteration:

I Assume we are in a certain state x at time h − 1 (last stage):

⇒ define Q1(x, d) ≡ r (x, d) ⇒ optimal decision: d∗ (x, h − 1) = arg maxdQ1(x, d) ⇒ SL on (xti, dti,rti) N×h to compute ˆQ 1(x, d) + arg max...

I Assume we are in state x at time h − 2:

⇒ define Q2(x, d) = r (x, d) + γ arg maxd0Q1(f (x, d), d0)

⇒ optimal decision to take: d∗

(x, h − 2) = arg maxdQ2(x, d) ⇒ SL on (xti, dti,rti + γ arg maxd0Qˆ1(xti+1, d 0 ))N×h ⇒ ˆQ2(x, d) + arg max ⇒ ˆd∗ (x, h − 2)

I Continue h − 2 further times to yield sequence of ˆQi-functions

and policy approximations ˆd∗

(x, h − i ), ∀t = 1, . . . , h.

I This algorithm is called “Fitted Q iteration”.

(19)

Fitted Q iteration: discussion

I All what the algorithm actually needs to work is a sample of

four-tuples (xti, dti, rti, xti+1)N, and a a good supervised

learning algorithm (for least squares regression).

I It has be shown to give excellent results when using ensemble

of regression trees as supervised learning algorithms

I It can cope (without modification) with stochastic problems.

I It can exploit efficiently very large samples.

I It already proved to work well on several complex continuous

state space problems.

I Iterating it sufficiently many times, it yields an approximation

of the optimal (closed-loop, and stationary) infinite horizon decision policy.

(20)

Illustration: power system control

Infinite Bus System V ∠0 E ∠δ Electrical power (Pe)

System reactance Variable reactance jXsystem ju Generator 10. 5. 0.0 −5. −10. −1. −.5 0.0 0.5 1. 1.5 2. ω δ 2.5 2. 1.5 1. 0.5 −.5 0.0 −1. 12.5 10. 7.5 5. 2.5 0.0 Pm Pe t(s)

(21)

The sequential decision problem

I Two state variables: δ and ω.

I Time discretization: time between t and t + 1 equal to 50 ms

I Two possible decisions d: capacitance set to zero or

capacitance set to its maximal value.

I r(xt, ut, wt) = −|Pe t+1− Pm| if xt+1∈ stability domain and

−100 otherwise

I γ = 0.98

I Sequential decision problem such that the optimal stationary

(22)

Representation of ˆ

d

(x

, t

) (h = 200), 10,000 four-tuples

ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.5 2. ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.5 2. ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.5 2. ˆ d∗(x, 200 − 1) dˆ∗(x, 200 − 5) dˆ∗(x, 200 − 10) ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.5 2. ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.5 2. ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.5 2. ˆ d∗(x, 200 − 20) dˆ∗(x, 200 − 50) dˆ∗(x, 0)

(23)

2.5 2. 1.5 1. 0.5 0.0 −.5 −1. 0.0 2.5 5. 7.5 10. 12.5 10. 5. 0.0 −5. −10. −1.−.50.00.51. 1.52. δ ω Pe t(s) ˆ d∗(x) = ˆd(x, 0) used to control the system (h = 200) 10,000 samples 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.52. ω δ ω δ 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.52. 10. 5. 0.0 −5. −10. −1.−.50.0 0.5 1. 1.52. ω δ ˆ d∗(x, 0) 100,000 samples ˆ d∗(x, 0) 10,000 samples ˆ d∗ (x, 0) 1000 samples

(24)

Finish

I Fitted Q iteration algorithm combined with ensemble of

regression trees has been evaluated on several problems and

was constantly givingsecond to none performances.

I Why has not this algorithm been proposed before ?

Références

Documents relatifs

PRIVATE OFFICE SPACE: Deputy Department GENERAL OFFICE SPACE: Secretarial, Executive Olerical.

At last, a post questionnaire was administered in the end of the experiment to the experimental group, so as to survey to what extent the Cooperative Learning

- Market regulation and decision making tool (price stability, crisis management, warning) • Limits:. - Government and large operator’s

Our framework has proven to successfully mine macro-operators of different lengths for four different domains (barman, blocksworld, depots, satellite) and thanks to the selection

In some applications such as MASH, simulations are too expensive in time and there is no prior knowledge and no model of the environment; therefore Monte-Carlo Tree Search can not

This paper looks at performance outcomes in the two most recent PISA cycles – 2012, when mathematics was a major assessment domain, and 2015, when mathematics was a minor domain,

 ناونع تحت ريتسجام ةركذم &#34; ةئشانلا عيراشملا ليومت يف رطاخملا لاملا سأر ةيلاعف - ةسارد ةمهاسممل ةيبورولأا ةيرئازجلا ةيلاملا ةلاح Finalap &#34; دادعإ

ﻲﺗدﺎﻌﺳ ﺬﺧﺄﺗ ﻼـﻓ ﻻﺎﻣ ﻲﻨﺘﻴﻄﻋأ نإ ... ﻲﺗﺮﻴﺼﺑ ﺬﺧﺄﺗ ﻼـﻓ ةﻮﻗ ﻲﻨﺘﻴﻄﻋأ نإ ... ﻲﻌﺿاﻮﺗ ﺬﺧﺄﺗ ﻼـﻓ ﺎﺣﺎﺠﻧ ﻲﻨﺘﻴﻄﻋأ نإ ... ﻲﺴﻔﻧ ﺐﺣأ ﺎﻤﻛ سﺎﻨﻟا ﺐﺣأ نأ ﻲﻨﻤﻠﻋ .. ﻲﺴﻔﻧ ﺐﺣأ ﺎﻤﻛ