“Actor-Critic” Reinforcement learning
GDR Robotique & Neurosciences
Alain Dutech
Equipe MAIA - LORIA - INRIA Nancy, France
Web :http://maia.loria.fr
Mail :[email protected] Paris - 06/09/2012
2
Learn to do
3
Outline
I Context : RL and direct policy search
I Direct Gradient : No Critic
I eNAC : Compatible Critic
I EM : MC Critic
I Summary, discussion
4
Actor-Critic architecture
Environment
Actor
Critic Sequence of state×actions :
x0−→π u0=⇒r1,x1−→π u1=⇒ · · ·uT−1=⇒rT,xT
5
Principles of “Direct Policy Search”
I Policy πisparametrized(byw)
I Every policy can, theoretically, be valued (V orJ or ...)
I Learn/Adapt :modify the parameters to increase value but
I How to modify the parameters ? (Gradientor EM)
I When is acriticneeded ?
I Beware of large sensorimotor space, local minima, ...
6
Common language
Value function.
VγT(x;w) =E(rt)
" T X
t=1
γtrt|x0=x, π
#
(1) Objective function
JγT(w) =Ex VγT(x)
(2)
¯J(w) = lim
T→∞E
"
1 T
T
X
t=1
rt
#
(3)
Linked :
J(w) =¯ µ>(I−γP)Jγ∞ (4)
withµstatic distribution andPtransition probabilities.
7
Outline
I Context : RL and direct policy search
I Direct gradient : No Critic
I eNAC : Compatible Critic
I EM : MC Critic
I Summary, discussion
8
Direct policy gradient
[Kimura et al., 1997], [Baxter and Bartlett, 2001]Gradient of the objective function (also withJγT).
∇wJ¯=r(x)∇wµ(x;π,w)
µ(x;π,w) (5)
Unbiasedestimation if regenerative process/episode
I Each step :zt+1=zt+∇π(xwπ(xt,ut,w)
t,ut,w) ,vt+1=vt+rt I End of episode : ∆j+1= ∆j+vtzt/length
I return∆N/N
Biasedon-line estimate
I Each step :zt+1=βzt+∇π(xwπ(xt,ut,w)
t,ut,w) ,wt+1=wt+αt.rt.zt I returnwT
9
Why a biased estimate ? (µ
>∇
w(J
β∞) 6= ∇
w(µ
>J
β∞)
∇w¯J= (1−β)∇w(µ>)Jβ∞+βµ>∇w(P)Jβ∞ (6) Left term vanishes
β→1lim(1−β)∇w(µ>)Jβ∞= 0 (7) but right term can have large variance whenβ→1.
Convergence of estimate to
(1−β)µ>∇w(Jβ∞) =βµ>∇w(P)Jβ∞ (8)
10
No critic
I direct estimation of gradient
I simple algorithms
I application to POMDPs, NN versions.
I lots of samples
I prone to local optimum
11
Outline
I Context : RL and direct policy search
I Direct Gradient : No Critic
I eNAC : Compatible Critic
I EM : MC Critic
I Summary, discussion
12
eNAC [Peters and Schaal, 2008]
Using baseline function for the gradient
∇wJ¯(w) =Ex∼µ,u∼π[∇wlogπ(x,u;w)[Qπ(x,u)−b(x)]] (9) Would like to useNatural Gradient ∇˜J¯
I sure convergence to local optimum
I fastest convergence and can sometimes prevent premature convergence
I independant of parametrization of the policy
I less samples to be correctly evaluated
∇˜¯J= (Ex∼µ,u∼π
h∇wlogπ(x,u)∇wlogπ(x,u)>i
| {z }
Fisher0s Information Matrix
)−1∇wJ¯ (10)
13
“Easy” computation of natural gradient
CompatibleQ-value approximation :∇θQˆθπ(x,u) =∇wlogπ(x,u) (for example ˆQθπ(x,u) =θ>∇wlogπ(x,u)).
Then, the best solutionθ∗ for the advantage function θ∗=argmin
θ
Ex∼µ,u∼π
h
Qπ(x,u)−Vπ(x)−Qˆθπ(x,u)i2
(11) is the natural gradient :∇wJ¯(w) =θ∗
(θ∗independant ofb(.),Vπ(x) minimizes variance onceθ∗found)
14
Episodic estimation of the advantage function
Advantage functionAπ(x,u) +Vπ(x) =r(x,u) +γE[Vπ(x)]
Then, along one trajectory
N−1
X
t=0
γtAπ(xt,ut) =−Vπ(x0) +
N−1
X
t=0
γtr(xt,ut) +γNVπ(xN) So, if enough trajectories (compared to size ofθ)
N−1
X
t=0
γt∇wlogπ(xt,ut)>θ+V0=
N−1
X
t=0
γtr(xt,ut) is a “simple” regression problem. (See Matthieu G.)
wn+1 =wn+α.θ∗ (12)
15
eNAC : Compatible Critic
I must use a special critic
I estimation ofV LSTD(λ), ...
I successes in robotic movements
I good choice of learning parameters and basis function
I ((we had difficulties to get it working))
16
Outline
I Context : RL and direct policy search
I Direct Gradient : No Critic
I eNAC : Compatible Critic
I EM : MC Critic
I Summary, discussion
17
The EM approach
[Kober and Peters, 2008], [Kormushev et al., 2010]Estimation
log ¯J(w) = log Z
τ
µ(τ;w)
µ(τ;w)µ(τ;w0)R(τ)dτ (13)
≥ Z
τ
µ(τ;w)R(τ) logµ(τ;w0)
µ(τ;w)dτ+K (14)
∝ −D(µ(τ;w)R(τ)||µ(τ;w0)) =l(w,w0) (15)
∇w0l(w,w0) =Ew
hPT
t=1∇w0π(x,u;w0)Qπ(x,u)i
Maximization
I Analytical solution to argmaxw0∇w0l(w,w0)
I Policy with exponential distributionfunction
I i.e.π(x,u;w) =N(w>φ(x,u);Σ(x) = []ij)
wn+1=wn+E
" T X
t=1
tQπ(x,u,t)
# /E
" T X
t=1
Qπ(x,u,t)
#
18
Illustration
19
EM
I no learning coefficient
I should be able to avoid some local maxima
I should use less samples
I no tried
I succes also because of “Movement Dynamic Primitive” ? [Schaal et al., 2007]
20
Outline
I Context : RL and direct policy search
I Direct Gradient : No Critic
I eNAC : Compatible Critic
I EM : MC Critic
I Summary, discussion
21
Conclusion
I Nothing original, just review of works
I Quite technical (sorry)
I kinethetic teaching
I parameters adaptation around shown example
I EM
I ... or Dynamic Movement Primitives [Schaal et al., 2007] ?
I ... or Importance sampling ?
22
Your turn :o)
23
R´ ef´ erences I
Baxter, J. and Bartlett, P. (2001).
Infinite-horizon policy-gradient estimation.
Journal of Artificial Intelligence Research, 15 :319–350.
Groupe PDMIA (2008).
Processus D´ecisionnels de Markov en Intelligence Artificielle. (Edit´e par Olivier Buffet et Olivier Sigaud), volume 1 & 2.
Lavoisier - Hermes Science Publications.
Kimura, H., Miyazaki, K., and Kobayashi, K. (1997).
Reinforcement learning in POMDPs with function approximation.
InProc. of the Fourteenth Int. Conf. on Machine Learning (ICML’97), pages 152–160.
24
R´ ef´ erences II
Kober, J. and Peters, J. (2008).
Policy search for motor primitives in robotics.
InAdvances in Neural Information Processing Systems, NIPS’08.
Kormushev, P., Calinon, S., and Caldwell, D. G. (2010).
Robot motor skill coordination with EM-based reinforcement learning.
InProc. IEEE/RSJ Intl Conf. on Intelligent Robots and Systems (IROS), pages 3232–3237.
Peters, J. and Schaal, S. (2008).
Natural actor-critic.
Neurocomputing, 71(7-9) :1180–1190.
25
R´ ef´ erences III
Puterman, M. (1994).
Markov Decision Processes : discrete stochastic dynamic programming.
John Wiley & Sons, Inc. New York, NY.
Schaal, S., Mohajerian, P., and Ijspeert, A. (2007).
Dynamics systems vs. optimal control–a unifying view.
Progress in brain research, 165 :425–445.
Sutton, R. and Barto, A. (1998).
Reinforcement Learning.
Bradford Book, MIT Press, Cambridge, MA.
26
Context No Critic eNAC EM Conclusion R´ef´erences Supp
Reinforcement Learning
a b
0.1 0.9 0.6 0.4
+5
I StatesX, ActionsU, probabilistictransitions.
I Ex,u∼π[P∞
t=1γtrt|x0=x,u0=u]
I Find the optimalpolicy π.
π:X −→ U
Action ’a’ or ’b’ in S
0?
Q∗(x,u) = r(x,u)+γ X
x0∈X
p(x0|u,x)max
u0∈U[Q∗(x0,u0)]
[Puterman, 1994], [Sutton and Barto, 1998], [Groupe PDMIA, 2008], ...
26
Context No Critic eNAC EM Conclusion R´ef´erences Supp
Reinforcement Learning
a b
0.1 0.9 0.6 0.4
+5
0.9
0.1
I StatesX, ActionsU, probabilistictransitions.
I Ex,u∼π[P∞
t=1γtrt|x0=x,u0=u]
I Find the optimalpolicy π.
π:X −→ U
Action ’a’ or ’b’ in S
0?
Q∗(x,u) = r(x,u)+γ X
x0∈X
p(x0|u,x)max
u0∈U[Q∗(x0,u0)]
[Puterman, 1994], [Sutton and Barto, 1998], [Groupe PDMIA, 2008], ...
26
Context No Critic eNAC EM Conclusion R´ef´erences Supp
Reinforcement Learning
a b
0.1 0.9 0.6 0.4
+5
0.9
0.1
I StatesX, ActionsU, probabilistictransitions.
I Ex,u∼π[P∞
t=1γtrt|x0=x,u0=u]
I Find the optimalpolicy π.
π:X −→ U
Action ’a’ or ’b’ in S
0?
Q∗(x,u) = r(x,u)+γ X
x0∈X
p(x0|u,x)max
u0∈U[Q∗(x0,u0)]
[Puterman, 1994], [Sutton and Barto, 1998], [Groupe PDMIA, 2008], ...
26
Reinforcement Learning
a b
0.1 0.9 0.6 0.4
+5
0.9
0.1
Compute directly theoptimalvalue function (as a solution to) : Q∗(x,u) = r(x,u)+γ X
x0∈X
p(x0|u,x)max
u0∈U[Q∗(x0,u0)]
[Puterman, 1994], [Sutton and Barto, 1998], [Groupe PDMIA, 2008], ...
27
Neurocontroler with Gaussian Noise
No Critic with continuous state and action6= Kimura or Baxter.
I u=w>Φ(x) +N(0,Σ)
I Φ of dimension512
I zk = (1−β)zk+ β∇wπ(xπ(xk,uk;wk)
k,uk;wk)
I wk+1=wk+α.rk.zk
28
Cumulative cost
29