“Actor-Critic” Reinforcement learning GDR Robotique & Neurosciences Alain Dutech

(1)

“Actor-Critic” Reinforcement learning

GDR Robotique & Neurosciences

Alain Dutech

Equipe MAIA - LORIA - INRIA Nancy, France

Web :http://maia.loria.fr

Mail :[email protected] Paris - 06/09/2012

(2)

2

Learn to do

(3)

3

Outline

I Context : RL and direct policy search

I Direct Gradient : No Critic

I eNAC : Compatible Critic

I EM : MC Critic

I Summary, discussion

(4)

4

Actor-Critic architecture

Environment

Actor

Critic Sequence of state×actions :

x₀−→^π u₀=⇒r₁,x₁−→^π u₁=⇒ · · ·u_T−1=⇒r_T,x_T

(5)

5

Principles of “Direct Policy Search”

I Policy πisparametrized(byw)

I Every policy can, theoretically, be valued (V orJ or ...)

I Learn/Adapt :modify the parameters to increase value but

I How to modify the parameters ? (Gradientor EM)

I When is acriticneeded ?

I Beware of large sensorimotor space, local minima, ...

(6)

6

Common language

Value function.

V_γ^T(x;w) =E(r_t)

" _T X

t=1

γ^tr_t|x₀=x, π

#

(1) Objective function

J_γ^T(w) =Ex V_γ^T(x)

(2)

¯J(w) = lim

T→∞E

"

1 T

T

X

t=1

rt

#

(3)

Linked :

J(w) =¯ µ^>(I−γP)J_γ^∞ (4)

withµstatic distribution andPtransition probabilities.

(7)

7

Outline

I Direct gradient : No Critic

I EM : MC Critic

(8)

8

Direct policy gradient

[Kimura et al., 1997], [Baxter and Bartlett, 2001]

Gradient of the objective function (also withJ_γ^T).

∇wJ¯=r(x)∇wµ(x;π,w)

µ(x;π,w) (5)

Unbiasedestimation if regenerative process/episode

I Each step :zt+1=zt+^∇_π(x^w^π(x^t^,u^t^,w)

t,u_t,w) ,vt+1=vt+rt I End of episode : ∆j+1= ∆j+vtzt/length

I return∆_N/N

Biasedon-line estimate

I Each step :zt+1=βzt+^∇_π(x^w^π(x^t^,u^t^,w)

t,ut,w) ,wt+1=wt+αt.rt.zt I returnwT

(9)

9

Why a biased estimate ? (µ

^>

∇

_w

(J

_β^∞

) 6= ∇

_w

(µ

^>

J

_β^∞

)

∇w¯J= (1−β)∇w(µ^>)J_β^∞+βµ^>∇w(P)J_β^∞ (6) Left term vanishes

β→1lim(1−β)∇w(µ^>)J_β^∞= 0 (7) but right term can have large variance whenβ→1.

Convergence of estimate to

(1−β)µ^>∇w(J_β^∞) =βµ^>∇w(P)J_β^∞ (8)

(10)

10

No critic

I direct estimation of gradient

I simple algorithms

I application to POMDPs, NN versions.

I lots of samples

I prone to local optimum

(11)

11

Outline

I EM : MC Critic

(12)

12

eNAC [Peters and Schaal, 2008]

Using baseline function for the gradient

∇wJ¯(w) =Ex∼µ,u∼π[∇wlogπ(x,u;w)[Q^π(x,u)−b(x)]] (9) Would like to useNatural Gradient ∇˜J¯

I sure convergence to local optimum

I fastest convergence and can sometimes prevent premature convergence

I independant of parametrization of the policy

I less samples to be correctly evaluated

∇˜¯J= (Ex∼µ,u∼π

h∇wlogπ(x,u)∇wlogπ(x,u)^>i

| {z }

Fisher⁰s Information Matrix

)⁻¹∇wJ¯ (10)

(13)

13

“Easy” computation of natural gradient

CompatibleQ-value approximation :∇θQˆ_θ^π(x,u) =∇wlogπ(x,u) (for example ˆQ_θ^π(x,u) =θ^>∇wlogπ(x,u)).

Then, the best solutionθ^∗ for the advantage function θ^∗=argmin

θ

Ex∼µ,u∼π

h

Q^π(x,u)−V^π(x)−Qˆ_θ^π(x,u)i²

(11) is the natural gradient :∇_wJ¯(w) =θ^∗

(θ^∗independant ofb(.),V^π(x) minimizes variance onceθ∗found)

(14)

14

Episodic estimation of the advantage function

Advantage functionA^π(x,u) +V^π(x) =r(x,u) +γE[V^π(x)]

Then, along one trajectory

N−1

X

t=0

γ^tA^π(x_t,u_t) =−V^π(x₀) +

N−1

X

t=0

γ^tr(x_t,u_t) +γ_NV^π(x_N) So, if enough trajectories (compared to size ofθ)

N−1

X

t=0

γt∇wlogπ(xt,ut)^>θ+V0=

N−1

X

t=0

γ^tr(xt,ut) is a “simple” regression problem. (See Matthieu G.)

w_n+1 =w_n+α.θ^∗ (12)

(15)

15

eNAC : Compatible Critic

I must use a special critic

I estimation ofV LSTD(λ), ...

I successes in robotic movements

I good choice of learning parameters and basis function

I ((we had difficulties to get it working))

(16)

16

Outline

I EM : MC Critic

(17)

17

The EM approach

[Kober and Peters, 2008], [Kormushev et al., 2010]

Estimation

log ¯J(w) = log Z

τ

µ(τ;w)

µ(τ;w)µ(τ;w⁰)R(τ)dτ (13)

≥ Z

τ

µ(τ;w)R(τ) logµ(τ;w⁰)

µ(τ;w)dτ+K (14)

∝ −D(µ(τ;w)R(τ)||µ(τ;w⁰)) =l(w,w⁰) (15)

∇_w0l(w,w⁰) =Ew

hPT

t=1∇_w0π(x,u;w⁰)Q^π(x,u)i

Maximization

I Analytical solution to argmax_w0∇w⁰l(w,w⁰)

I Policy with exponential distributionfunction

I i.e.π(x,u;w) =N(w^>φ(x,u);Σ(x) = []_ij)

wn+1=wn+E

" _T X

t=1

tQ^π(x,u,t)

# /E

" _T X

t=1

Q^π(x,u,t)

#

(18)

18

Illustration

(19)

19

EM

I no learning coefficient

I should be able to avoid some local maxima

I should use less samples

I no tried

I succes also because of “Movement Dynamic Primitive” ? [Schaal et al., 2007]

(20)

20

Outline

I EM : MC Critic

(21)

21

Conclusion

I Nothing original, just review of works

I Quite technical (sorry)

I kinethetic teaching

I parameters adaptation around shown example

I EM

I ... or Dynamic Movement Primitives [Schaal et al., 2007] ?

I ... or Importance sampling ?

(22)

22

Your turn :o)

(23)

23

R´ ef´ erences I

Baxter, J. and Bartlett, P. (2001).

Infinite-horizon policy-gradient estimation.

Journal of Artificial Intelligence Research, 15 :319–350.

Groupe PDMIA (2008).

Processus D´ecisionnels de Markov en Intelligence Artificielle. (Edit´e par Olivier Buffet et Olivier Sigaud), volume 1 & 2.

Lavoisier - Hermes Science Publications.

Kimura, H., Miyazaki, K., and Kobayashi, K. (1997).

Reinforcement learning in POMDPs with function approximation.

InProc. of the Fourteenth Int. Conf. on Machine Learning (ICML’97), pages 152–160.

(24)

24

R´ ef´ erences II

Kober, J. and Peters, J. (2008).

Policy search for motor primitives in robotics.

InAdvances in Neural Information Processing Systems, NIPS’08.

Kormushev, P., Calinon, S., and Caldwell, D. G. (2010).

Robot motor skill coordination with EM-based reinforcement learning.

InProc. IEEE/RSJ Intl Conf. on Intelligent Robots and Systems (IROS), pages 3232–3237.

Peters, J. and Schaal, S. (2008).

Natural actor-critic.

Neurocomputing, 71(7-9) :1180–1190.

(25)

25

R´ ef´ erences III

Puterman, M. (1994).

Markov Decision Processes : discrete stochastic dynamic programming.

John Wiley & Sons, Inc. New York, NY.

Schaal, S., Mohajerian, P., and Ijspeert, A. (2007).

Dynamics systems vs. optimal control–a unifying view.

Progress in brain research, 165 :425–445.

Sutton, R. and Barto, A. (1998).

Reinforcement Learning.

Bradford Book, MIT Press, Cambridge, MA.

(26)

26

Context No Critic eNAC EM Conclusion R´ef´erences Supp

Reinforcement Learning

a b

0.1 0.9 0.6 0.4

+5

I StatesX, ActionsU, probabilistictransitions.

I E_x,u∼π[P∞

t=1γ^tr_t|x0=x,u₀=u]

I Find the optimalpolicy π.

π:X −→ U

Action ’a’ or ’b’ in S

₀

?

Q^∗(x,u) = r(x,u)+γ X

x⁰∈X

p(x⁰|u,x)max

u⁰∈U[Q^∗(x⁰,u⁰)]

[Puterman, 1994], [Sutton and Barto, 1998], [Groupe PDMIA, 2008], ...

(27)

26

Reinforcement Learning

a b

0.1 0.9 0.6 0.4

+5

0.9

0.1

I E_x,u∼π[P∞

t=1γ^tr_t|x0=x,u₀=u]

π:X −→ U

Action ’a’ or ’b’ in S

₀

?

Q^∗(x,u) = r(x,u)+γ X

x⁰∈X

p(x⁰|u,x)max

u⁰∈U[Q^∗(x⁰,u⁰)]

(28)

26

Reinforcement Learning

a b

0.1 0.9 0.6 0.4

+5

0.9

0.1

I Ex,u∼π[P∞

t=1γ^trt|x0=x,u0=u]

π:X −→ U

Action ’a’ or ’b’ in S

₀

?

Q^∗(x,u) = r(x,u)+γ X

x⁰∈X

p(x⁰|u,x)max

u⁰∈U[Q^∗(x⁰,u⁰)]

(29)

26

Reinforcement Learning

a b

0.1 0.9 0.6 0.4

+5

0.9

0.1

Compute directly theoptimalvalue function (as a solution to) : Q^∗(x,u) = r(x,u)+γ X

x⁰∈X

p(x⁰|u,x)max

u⁰∈U[Q^∗(x⁰,u⁰)]

(30)

27

Neurocontroler with Gaussian Noise

No Critic with continuous state and action6= Kimura or Baxter.

I u=w^>Φ(x) +N(0,Σ)

I Φ of dimension512

I zk = (1−β)zk+ β^∇^w_π(x^π(x^k^,u^k^;w^k⁾

k,u_k;w_k)

I w_k+1=w_k+α.r_k.z_k

(31)

28

Cumulative cost

(32)

29

“Actor-Critic” Reinforcement learning GDR Robotique & Neurosciences Alain Dutech