• Aucun résultat trouvé

“Actor-Critic” Reinforcement learning GDR Robotique & Neurosciences Alain Dutech

N/A
N/A
Protected

Academic year: 2022

Partager "“Actor-Critic” Reinforcement learning GDR Robotique & Neurosciences Alain Dutech"

Copied!
32
0
0

Texte intégral

(1)

“Actor-Critic” Reinforcement learning

GDR Robotique & Neurosciences

Alain Dutech

Equipe MAIA - LORIA - INRIA Nancy, France

Web :http://maia.loria.fr

Mail :[email protected] Paris - 06/09/2012

(2)

2

Learn to do

(3)

3

Outline

I Context : RL and direct policy search

I Direct Gradient : No Critic

I eNAC : Compatible Critic

I EM : MC Critic

I Summary, discussion

(4)

4

Actor-Critic architecture

Environment

Actor

Critic Sequence of state×actions :

x0−→π u0=⇒r1,x1−→π u1=⇒ · · ·uT−1=⇒rT,xT

(5)

5

Principles of “Direct Policy Search”

I Policy πisparametrized(byw)

I Every policy can, theoretically, be valued (V orJ or ...)

I Learn/Adapt :modify the parameters to increase value but

I How to modify the parameters ? (Gradientor EM)

I When is acriticneeded ?

I Beware of large sensorimotor space, local minima, ...

(6)

6

Common language

Value function.

VγT(x;w) =E(rt)

" T X

t=1

γtrt|x0=x, π

#

(1) Objective function

JγT(w) =Ex VγT(x)

(2)

¯J(w) = lim

T→∞E

"

1 T

T

X

t=1

rt

#

(3)

Linked :

J(w) =¯ µ>(IγP)Jγ (4)

withµstatic distribution andPtransition probabilities.

(7)

7

Outline

I Context : RL and direct policy search

I Direct gradient : No Critic

I eNAC : Compatible Critic

I EM : MC Critic

I Summary, discussion

(8)

8

Direct policy gradient

[Kimura et al., 1997], [Baxter and Bartlett, 2001]

Gradient of the objective function (also withJγT).

wJ¯=r(x)∇wµ(x;π,w)

µ(x;π,w) (5)

Unbiasedestimation if regenerative process/episode

I Each step :zt+1=zt+π(xwπ(xt,ut,w)

t,ut,w) ,vt+1=vt+rt I End of episode : ∆j+1= ∆j+vtzt/length

I return∆N/N

Biasedon-line estimate

I Each step :zt+1=βzt+π(xwπ(xt,ut,w)

t,ut,w) ,wt+1=wtt.rt.zt I returnwT

(9)

9

Why a biased estimate ? (µ

>

w

(J

β

) 6= ∇

w

>

J

β

)

w¯J= (1−β)∇w>)Jβ+βµ>w(P)Jβ (6) Left term vanishes

β→1lim(1−β)∇w>)Jβ= 0 (7) but right term can have large variance whenβ→1.

Convergence of estimate to

(1−β)µ>w(Jβ) =βµ>w(P)Jβ (8)

(10)

10

No critic

I direct estimation of gradient

I simple algorithms

I application to POMDPs, NN versions.

I lots of samples

I prone to local optimum

(11)

11

Outline

I Context : RL and direct policy search

I Direct Gradient : No Critic

I eNAC : Compatible Critic

I EM : MC Critic

I Summary, discussion

(12)

12

eNAC [Peters and Schaal, 2008]

Using baseline function for the gradient

wJ¯(w) =Ex∼µ,u∼π[∇wlogπ(x,u;w)[Qπ(x,u)−b(x)]] (9) Would like to useNatural Gradient ∇˜J¯

I sure convergence to local optimum

I fastest convergence and can sometimes prevent premature convergence

I independant of parametrization of the policy

I less samples to be correctly evaluated

∇˜¯J= (Ex∼µ,u∼π

h∇wlogπ(x,u)∇wlogπ(x,u)>i

| {z }

Fisher0s Information Matrix

)−1wJ¯ (10)

(13)

13

“Easy” computation of natural gradient

CompatibleQ-value approximation :∇θθπ(x,u) =∇wlogπ(x,u) (for example ˆQθπ(x,u) =θ>wlogπ(x,u)).

Then, the best solutionθ for the advantage function θ=argmin

θ

Ex∼µ,u∼π

h

Qπ(x,u)−Vπ(x)−Qˆθπ(x,u)i2

(11) is the natural gradient :∇wJ¯(w) =θ

independant ofb(.),Vπ(x) minimizes variance onceθ∗found)

(14)

14

Episodic estimation of the advantage function

Advantage functionAπ(x,u) +Vπ(x) =r(x,u) +γE[Vπ(x)]

Then, along one trajectory

N−1

X

t=0

γtAπ(xt,ut) =−Vπ(x0) +

N−1

X

t=0

γtr(xt,ut) +γNVπ(xN) So, if enough trajectories (compared to size ofθ)

N−1

X

t=0

γtwlogπ(xt,ut)>θ+V0=

N−1

X

t=0

γtr(xt,ut) is a “simple” regression problem. (See Matthieu G.)

wn+1 =wn+α.θ (12)

(15)

15

eNAC : Compatible Critic

I must use a special critic

I estimation ofV LSTD(λ), ...

I successes in robotic movements

I good choice of learning parameters and basis function

I ((we had difficulties to get it working))

(16)

16

Outline

I Context : RL and direct policy search

I Direct Gradient : No Critic

I eNAC : Compatible Critic

I EM : MC Critic

I Summary, discussion

(17)

17

The EM approach

[Kober and Peters, 2008], [Kormushev et al., 2010]

Estimation

log ¯J(w) = log Z

τ

µ(τ;w)

µ(τ;w)µ(τ;w0)R(τ)dτ (13)

≥ Z

τ

µ(τ;w)R(τ) logµ(τ;w0)

µ(τ;w)dτ+K (14)

∝ −D(µ(τ;w)R(τ)||µ(τ;w0)) =l(w,w0) (15)

w0l(w,w0) =Ew

hPT

t=1w0π(x,u;w0)Qπ(x,u)i

Maximization

I Analytical solution to argmaxw0w0l(w,w0)

I Policy with exponential distributionfunction

I i.e.π(x,u;w) =N(w>φ(x,u);Σ(x) = []ij)

wn+1=wn+E

" T X

t=1

tQπ(x,u,t)

# /E

" T X

t=1

Qπ(x,u,t)

#

(18)

18

Illustration

(19)

19

EM

I no learning coefficient

I should be able to avoid some local maxima

I should use less samples

I no tried

I succes also because of “Movement Dynamic Primitive” ? [Schaal et al., 2007]

(20)

20

Outline

I Context : RL and direct policy search

I Direct Gradient : No Critic

I eNAC : Compatible Critic

I EM : MC Critic

I Summary, discussion

(21)

21

Conclusion

I Nothing original, just review of works

I Quite technical (sorry)

I kinethetic teaching

I parameters adaptation around shown example

I EM

I ... or Dynamic Movement Primitives [Schaal et al., 2007] ?

I ... or Importance sampling ?

(22)

22

Your turn :o)

(23)

23

R´ ef´ erences I

Baxter, J. and Bartlett, P. (2001).

Infinite-horizon policy-gradient estimation.

Journal of Artificial Intelligence Research, 15 :319–350.

Groupe PDMIA (2008).

Processus D´ecisionnels de Markov en Intelligence Artificielle. (Edit´e par Olivier Buffet et Olivier Sigaud), volume 1 & 2.

Lavoisier - Hermes Science Publications.

Kimura, H., Miyazaki, K., and Kobayashi, K. (1997).

Reinforcement learning in POMDPs with function approximation.

InProc. of the Fourteenth Int. Conf. on Machine Learning (ICML’97), pages 152–160.

(24)

24

R´ ef´ erences II

Kober, J. and Peters, J. (2008).

Policy search for motor primitives in robotics.

InAdvances in Neural Information Processing Systems, NIPS’08.

Kormushev, P., Calinon, S., and Caldwell, D. G. (2010).

Robot motor skill coordination with EM-based reinforcement learning.

InProc. IEEE/RSJ Intl Conf. on Intelligent Robots and Systems (IROS), pages 3232–3237.

Peters, J. and Schaal, S. (2008).

Natural actor-critic.

Neurocomputing, 71(7-9) :1180–1190.

(25)

25

R´ ef´ erences III

Puterman, M. (1994).

Markov Decision Processes : discrete stochastic dynamic programming.

John Wiley & Sons, Inc. New York, NY.

Schaal, S., Mohajerian, P., and Ijspeert, A. (2007).

Dynamics systems vs. optimal control–a unifying view.

Progress in brain research, 165 :425–445.

Sutton, R. and Barto, A. (1998).

Reinforcement Learning.

Bradford Book, MIT Press, Cambridge, MA.

(26)

26

Context No Critic eNAC EM Conclusion ef´erences Supp

Reinforcement Learning

a b

0.1 0.9 0.6 0.4

+5

I StatesX, ActionsU, probabilistictransitions.

I Ex,u∼π[P

t=1γtrt|x0=x,u0=u]

I Find the optimalpolicy π.

π:X −→ U

Action ’a’ or ’b’ in S

0

?

Q(x,u) = r(x,u)+γ X

x0∈X

p(x0|u,x)max

u0∈U[Q(x0,u0)]

[Puterman, 1994], [Sutton and Barto, 1998], [Groupe PDMIA, 2008], ...

(27)

26

Context No Critic eNAC EM Conclusion ef´erences Supp

Reinforcement Learning

a b

0.1 0.9 0.6 0.4

+5

0.9

0.1

I StatesX, ActionsU, probabilistictransitions.

I Ex,u∼π[P

t=1γtrt|x0=x,u0=u]

I Find the optimalpolicy π.

π:X −→ U

Action ’a’ or ’b’ in S

0

?

Q(x,u) = r(x,u)+γ X

x0∈X

p(x0|u,x)max

u0∈U[Q(x0,u0)]

[Puterman, 1994], [Sutton and Barto, 1998], [Groupe PDMIA, 2008], ...

(28)

26

Context No Critic eNAC EM Conclusion ef´erences Supp

Reinforcement Learning

a b

0.1 0.9 0.6 0.4

+5

0.9

0.1

I StatesX, ActionsU, probabilistictransitions.

I Ex,u∼π[P

t=1γtrt|x0=x,u0=u]

I Find the optimalpolicy π.

π:X −→ U

Action ’a’ or ’b’ in S

0

?

Q(x,u) = r(x,u)+γ X

x0∈X

p(x0|u,x)max

u0∈U[Q(x0,u0)]

[Puterman, 1994], [Sutton and Barto, 1998], [Groupe PDMIA, 2008], ...

(29)

26

Reinforcement Learning

a b

0.1 0.9 0.6 0.4

+5

0.9

0.1

Compute directly theoptimalvalue function (as a solution to) : Q(x,u) = r(x,u)+γ X

x0∈X

p(x0|u,x)max

u0∈U[Q(x0,u0)]

[Puterman, 1994], [Sutton and Barto, 1998], [Groupe PDMIA, 2008], ...

(30)

27

Neurocontroler with Gaussian Noise

No Critic with continuous state and action6= Kimura or Baxter.

I u=w>Φ(x) +N(0,Σ)

I Φ of dimension512

I zk = (1−β)zk+ βwπ(xπ(xk,uk;wk)

k,uk;wk)

I wk+1=wk+α.rk.zk

(31)

28

Cumulative cost

(32)

29

Motor response

Références

Documents relatifs

To properly model the crawling environment as a Markov decision process (MDP), instead of considering each individual page as a state and each individual hyper- link as an action,

Section 2 gives some details about how an agent knowledge model can be maintained in a robotic system; in Sec- tion 3 our extension of a state-of-art goal-oriented POMDP

Thus, they can be used to generate explanations of the agent behaviour ("why" and "why not" questions), based on knowledge about how actions influence the

Pham Tran Anh Quang, Yassine Hadjadj-Aoul, and Abdelkader Outtagarts : “A deep reinforcement learning approach for VNF Forwarding Graph Embedding”. In IEEE, Transactions on Network

Three experiments were carried out to both explore the natural incline and decline of agents colliding into each other throughout a simulation, and find ways in which it can be

The critic approximates the action value function Q(s,a), and the actor network is trained to maximize the value of actions it outputs with re- spect to the action value function

Zhou, F., Zhou, H., Yang, Z., Yang, L.: EMD2FNN: A strategy combining empirical mode decomposition and factorization machine based neural network for stock market

Since, in our setting, the optimism principle is only used for the reward confidence interval, we can dissociate the continuous linear parametrisation ( S ) used for learning