Covariance Matrix Adaptation for Direct Reinforcement Learning
Freek Stulp
Cognitive Robotics Group – ENSTA-ParisTech FLOWERS Team – INRIA/Bordeaux
Olivier Sigaud
Institut des Systmes Intelligents et de Robotique – UPMC CNRS UMR 7222
JFPDA 23.05.2012
Motivation
PI 2 powerful algorithm for direct RL on robots
Tuning exploration parameter in PI
2Grasping under Uncertainty
IROS’11, ICRA’11
Pick-and-Place Tasks
Humanoids’11
Motivation
PI 2 powerful algorithm for direct RL on robots
Tuning exploration parameter in PI
2Motivation
PI 2 powerful algorithm for direct RL on robots Tuning exploration parameter in PI
2Update rules for PI 2 and CEM / CMA-ES almost identical
But CEM and CMA-ES additionaly tune exploration parameter automatically
Goals
Analysis and comparison of PI
2/ CEM / CMA-ES Novel algorithm PI
2-CMA
Essentially PI
2with automatic exploration parameter tuning
Motivation
PI 2 powerful algorithm for direct RL on robots Tuning exploration parameter in PI
2Update rules for PI 2 and CEM / CMA-ES almost identical
But CEM and CMA-ES additionaly tune exploration parameter automatically
Goals
Analysis and comparison of PI
2/ CEM / CMA-ES Novel algorithm PI
2-CMA
Essentially PI
2with automatic exploration parameter tuning
Outline
PI 2
CEM CMA-ES
Reward-Weighted Averaging
Comparison
Evaluation Task
PI 2 -CMA
Evaluation
Adaptive Exploration
Outline
PI 2 CEM
CMA-ES
Reward-Weighted Averaging
Comparison
Evaluation Task
PI 2 -CMA
Evaluation
Adaptive Exploration
Outline
PI 2 CEM CMA-ES
Reward-Weighted Averaging
Comparison
Evaluation Task
PI 2 -CMA
Evaluation
Adaptive Exploration
Outline
PI 2 CEM CMA-ES
Reward-Weighted Averaging
Comparison
Evaluation Task
PI 2 -CMA
Evaluation
Adaptive Exploration
Outline
PI 2 CEM CMA-ES
Reward-Weighted Averaging
Comparison
Evaluation Task
PI 2 -CMA
Evaluation
Adaptive Exploration
Outline
PI 2 CEM CMA-ES
Reward-Weighted Averaging
Comparison
Evaluation Task
PI 2 -CMA
Evaluation
Adaptive Exploration
Outline
PI 2 CEM CMA-ES
Reward-Weighted Averaging
Comparison
Evaluation Task
PI 2 -CMA
Evaluation
Adaptive Exploration
Outline
PI 2 CEM CMA-ES
Reward-Weighted Averaging
Comparison
Evaluation Task
PI 2 -CMA Evaluation
Adaptive Exploration
Outline
PI 2 CEM CMA-ES
Reward-Weighted Averaging
Comparison
Evaluation Task
PI 2 -CMA
Evaluation
Adaptive Exploration
PI 2 - Derivation Outline
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
From first principles of Stochastic Optimal Control Start with Hamilton Jacobi Bellman equations
1
Log transformation + benign assumption
⇒ Linear!
2
Transform PDE to path integral with Feynman-Kac theorem
⇒ Roll-outs!
3
Apply to parameterized policies
⇒ Model free!
⇒ Iterative update rule for θ
PI 2 - Algorithm
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
Words
Images
Formulae
Input:
DMP with initial parameters θ, cost function J
While (cost not converged) Explore
sample exploration vectors execute DMP
determine cost
Update compute prob. from cost prob.-weighted averaging parameter update
1 τ
¨xt=α(β(g−xt)−xt) +˙ gT t
(
θ
+t,k)
J(τi)=φtN+ ZtN
ti (qt+1 2
θT tRθt)dt
t,k∼ N(0,Σ) θk=θ+k
θ←θ+δθ
P τi,k
= e−1
λJ(τi,k)
PK k=1[e−1
λJ(τi,k) ]
δθti= K X
k=1 h
P
τi,k
Mti,kti,k i
Input: DMP with initial parameters θ
, cost function J While (cost not converged)
Explore
sample exploration vectors execute DMP
determine cost
Update compute prob. from cost prob.-weighted averaging parameter update
1 τ
¨xt=α(β(g−xt)−xt) +˙ gT t
(
θ
+t,k)
J(τi)=φtN+ ZtN
ti (qt+1 2θT
tRθt)dt
t,k∼ N(0,Σ) θk=θ+k
θ←θ+δθ
P τi,k
= e−1
λJ(τi,k)
PK k=1[e−1
λJ(τi,k) ]
δθti= K X
k=1 h
P
τi,k
Mti,kti,k i
Input: DMP with initial parameters θ, cost function J
While (cost not converged) Explore
sample exploration vectors execute DMP
determine cost
Update compute prob. from cost prob.-weighted averaging parameter update
1 τ
¨xt=α(β(g−xt)−xt) +˙ gT t
(
θ
+t,k)
J(τi)=φtN+ ZtN
ti (qt+1 2θT
tRθt)dt
t,k∼ N(0,Σ) θk=θ+k
θ←θ+δθ
P
τi,k
= e−1
λJ(τi,k)
PK k=1[e−1
λJ(τi,k) ]
δθti= K X
k=1 h
P
τi,k
Mti,kti,k i
Input: DMP with initial parameters θ, cost function J
While (cost not converged) Explore
sample exploration vectors execute DMP
determine cost
Update compute prob. from cost prob.-weighted averaging parameter update
1 τ
¨xt=α(β(g−xt)−xt) +˙ gT t
(
θ
+t,k)
J(τi)=φtN+ ZtN
ti (qt+1 2θT
tRθt)dt
t,k∼ N(0,Σ) θk=θ+k
θ←θ+δθ
P
τi,k
= e−1
λJ(τi,k)
PK k=1[e−1
λJ(τi,k) ]
δθti= K X
k=1 h
P
τi,k
Mti,kti,k i
Input: DMP with initial parameters θ, cost function J While (cost not converged)
Explore
sample exploration vectors execute DMP
determine cost
Update
compute prob. from cost prob.-weighted averaging parameter update
1 τ
¨xt=α(β(g−xt)−xt) +˙ gT t
(
θ
+t,k)
J(τi)=φtN+ ZtN
ti (qt+1 2θT
tRθt)dt
t,k∼ N(0,Σ) θk=θ+k
θ←θ+δθ
P
τi,k
= e−1
λJ(τi,k)
PK k=1[e−1
λJ(τi,k) ]
δθti= K X
k=1 h
P
τi,k
Mti,kti,k i
Input: DMP with initial parameters θ, cost function J While (cost not converged)
Explore
sample exploration vectors
execute DMP determine cost
Update
compute prob. from cost prob.-weighted averaging parameter update
1 τ
¨xt=α(β(g−xt)−xt) +˙ gT t(θ+t,k)
J(τi)=φtN+ ZtN
ti (qt+1 2
θT tRθt)dt
t,k∼ N(0,Σ) θk=θ+k
θ←θ+δθ
P τi,k
= e−1
λJ(τi,k)
PK k=1[e−1
λJ(τi,k) ]
δθti= K X
k=1 h
P
τi,k
Mti,kti,k i
Input: DMP with initial parameters θ, cost function J While (cost not converged)
Explore
sample exploration vectors execute DMP
determine cost
Update
compute prob. from cost prob.-weighted averaging parameter update
1 τ
¨xt=α(β(g−xt)−xt) +˙ gT t(θ+t,k)
J(τi)=φtN+ ZtN
ti (qt+1 2
θT tRθt)dt
t,k∼ N(0,Σ) θk=θ+k
θ←θ+δθ
P τi,k
= e−1
λJ(τi,k)
PK k=1[e−1
λJ(τi,k) ]
δθti= K X
k=1 h
P
τi,k
Mti,kti,k i
Input: DMP with initial parameters θ, cost function J While (cost not converged)
Explore
sample exploration vectors execute DMP
determine cost
Update
compute prob. from cost prob.-weighted averaging parameter update
1 τ
¨xt=α(β(g−xt)−xt) +˙ gT t(θ+t,k)
J(τi)=φtN+ ZtN
ti (qt+1 2
θT tRθt)dt
t,k∼ N(0,Σ) θk=θ+k
θ←θ+δθ
P τi,k
= e−1
λJ(τi,k)
PK k=1[e−1
λJ(τi,k) ]
δθti= K X
k=1 h
P
τi,k
Mti,kti,k i
Input: DMP with initial parameters θ, cost function J While (cost not converged)
Explore
sample exploration vectors execute DMP
determine cost
Update
compute prob. from cost prob.-weighted averaging parameter update
1 τ
¨xt=α(β(g−xt)−xt) +˙ gT t(θ+t,k)
J(τi)=φtN+ ZtN
ti (qt+1 2
θT tRθt)dt
t,k∼ N(0,Σ) θk=θ+k
θ←θ+δθ
P τi,k
= e−1
λJ(τi,k)
PK k=1[e−1
λJ(τi,k) ]
δθti= K X
k=1 h
P
τi,k
Mti,kti,k i
Input: DMP with initial parameters θ, cost function J While (cost not converged)
Explore
sample exploration vectors execute DMP
determine cost
Update compute prob. from cost
prob.-weighted averaging parameter update
1 τ
¨xt=α(β(g−xt)−xt) +˙ gT t(θ+t,k)
J(τi)=φtN+ ZtN
ti (qt+1 2
θT tRθt)dt
t,k∼ N(0,Σ) θk=θ+k
θ←θ+δθ
P τi,k
= e−1
λJ(τi,k)
PK k=1[e−1
λJ(τi,k) ]
δθti= K X
k=1 h
P
τi,k
Mti,kti,k i
Input: DMP with initial parameters θ, cost function J While (cost not converged)
Explore
sample exploration vectors execute DMP
determine cost
Update compute prob. from cost prob.-weighted averaging
parameter update
1 τ
¨xt=α(β(g−xt)−xt) +˙ gT t(θ+t,k)
J(τi)=φtN+ ZtN
ti (qt+1 2
θT tRθt)dt
t,k∼ N(0,Σ) θk=θ+k
θ←θ+δθ
P τi,k
= e−1
λJ(τi,k)
PK k=1[e−1
λJ(τi,k) ]
δθti= K X
k=1 h
P
τi,k
Mti,kti,k i
Input: DMP with initial parameters θ, cost function J While (cost not converged)
Explore
sample exploration vectors execute DMP
determine cost
Update compute prob. from cost prob.-weighted averaging parameter update
1 τ
¨xt=α(β(g−xt)−xt) +˙ gT t(θ+t,k)
J(τi)=φtN+ ZtN
ti (qt+1 2
θT tRθt)dt
t,k∼ N(0,Σ) θk=θ+k
θ←θ+δθ
P τi,k
= e−1
λJ(τi,k)
PK k=1[e−1
λJ(τi,k) ]
δθti= K X
k=1 h
P
τi,k
Mti,kti,k i
Input: DMP with initial parameters θ, cost function J While (cost not converged)
Explore
sample exploration vectors execute DMP
determine cost
Update compute prob. from cost prob.-weighted averaging parameter update
1 τ
¨xt=α(β(g−xt)−xt) +˙ gT t(θ+t,k)
J(τi)=φtN+ ZtN
ti (qt+1 2
θT tRθt)dt
t,k∼ N(0,Σ) θk=θ+k
θ←θ+δθ
P τi,k
= e−1
λJ(τi,k)
PK k=1[e−1
λJ(τi,k) ]
δθti= K X
k=1 h
P
τi,k
Mti,kti,k i
PI 2 - Algorithm
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
Advantages
No gradient ⇒ Deals with discontinous noisy cost functions Update δ within convex hull of
k=1...K⇒ Safe update rule Arbitrary cost functions
Model-free Fast convergence
Only one open parameter: magnitude of exploration (
t,k∼ N(0, Σ)) Disadvantage
No global convergence guarantees...
Robotics: This is where imitation comes in! π(θ
imit) ≈ π(θ
∗)
Applied to very high-dimensional, complex tasks...
Cross-Entropy Method ( CEM )
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
θk=1...K∼
N(θ,
Σ)
∀k Jk
=
J(θk)
θk=1...K←sortθk=1...Kw.r.tJk=1...K θnew
=
Ke
X
k=1
1
KeθkΣ
new=
Ke
X
k=1
1
Ke(θ
k−θ)(θk−θ)|This algorithm can be interpreted as performing reward-weighted averaging
Cross-Entropy Method ( CEM )
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
θk=1...K∼ N(θ,
Σ)
∀k Jk
=
J(θk)
θk=1...K←sortθk=1...Kw.r.tJk=1...K θnew
=
Ke
X
k=1
1
KeθkΣ
new=
Ke
X
k=1
1
Ke(θ
k−θ)(θk−θ)|This algorithm can be interpreted as performing reward-weighted averaging
Cross-Entropy Method ( CEM )
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
θk=1...K∼ N(θ,
Σ)
∀k Jk
=
J(θk)
θk=1...K←sortθk=1...Kw.r.tJk=1...K θnew
=
Ke
X
k=1
1
KeθkΣ
new=
Ke
X
k=1
1
Ke(θ
k−θ)(θk−θ)|This algorithm can be interpreted as performing reward-weighted averaging
Cross-Entropy Method ( CEM )
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
θk=1...K∼ N(θ,
Σ)
∀k Jk
=
J(θk)
θk=1...K←sortθk=1...Kw.r.tJk=1...K
θnew
=
Ke
X
k=1
1
KeθkΣ
new=
Ke
X
k=1
1
Ke(θ
k−θ)(θk−θ)|This algorithm can be interpreted as performing reward-weighted averaging
Cross-Entropy Method ( CEM )
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
θk=1...K∼ N(θ,
Σ)
∀k Jk
=
J(θk)
θk=1...K←sortθk=1...Kw.r.tJk=1...K θnew
=
Ke
X
k=1
1
KeθkΣ
new=
Ke
X
k=1
1
Ke(θ
k−θ)(θk−θ)|This algorithm can be interpreted as performing reward-weighted averaging
Cross-Entropy Method ( CEM )
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
θk=1...K∼ N(θ,
Σ)
∀k Jk
=
J(θk)
θk=1...K←sortθk=1...Kw.r.tJk=1...K θnew
=
Ke
X
k=1
1
KeθkΣ
new=
Ke
X
k=1
1
Ke(θ
k−θ)(θk−θ)|This algorithm can be interpreted as performing reward-weighted averaging
Cross-Entropy Method ( CEM )
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
θk=1...K∼ N(θ,
Σ)
∀k Jk
=
J(θk)
θk=1...K←sortθk=1...Kw.r.tJk=1...K θnew
=
Ke
X
k=1
1
KeθkΣ
new=
Ke
X
k=1
1
Ke(θ
k−θ)(θk−θ)|This algorithm can be interpreted as performing reward-weighted averaging
CMA-ES
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
Covariance Matrix Adaptation - Evolutionary Strategy Like CEM , but
Different mapping from cost to probability
More sophisticated method for updating covariance matrix:
pσ←
(1
−cσ)
pσ+
qcσ
(2
−cσ)µ
PΣ
−1θnew−θσ
(1)
σnew
=
σ×exp
cσdσ
kpσk
EkN
(0,
I)k−1 (2)
pΣ←
(1
−cΣ)
pΣ+
hσq
cΣ
(2
−cΣ)µ
Pθnew−θσ
(3)
Σ
new= (1
−c1−cµ) Σ +
c1(p
ΣpTΣ+
δ(hσ)Σ)
+
cµKe
X
k=1
Pk
(θ
k−θ)(θk−θ)|(4)
PI 2 / CEM / CMA-ES - Similarities
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
PI 2 / CEM / CMA-ES are all based on
Exploration: sample from a Gaussian θ
k∼ N (θ, Σ) Parameter update: Reward-Weighted Averaging θ
new= P
Kk=1
P
kθ
kSimilarity striking, as algorithms derived from very different principles!
CEM is a special case of CMA-ES : proof in paper.
(maybe this was already known?)
PI 2 / CEM / CMA-ES - Similarities
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
PI 2 / CEM / CMA-ES are all based on
Exploration: sample from a Gaussian θ
k∼ N (θ, Σ) Parameter update: Reward-Weighted Averaging θ
new= P
Kk=1
P
kθ
kSimilarity striking, as algorithms derived from very different principles!
CEM is a special case of CMA-ES : proof in paper.
(maybe this was already known?)
PI 2 / CEM / CMA-ES - Similarities
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
PI 2 / CEM / CMA-ES are all based on
Exploration: sample from a Gaussian θ
k∼ N (θ, Σ) Parameter update: Reward-Weighted Averaging θ
new= P
Kk=1
P
kθ
kSimilarity striking, as algorithms derived from very different principles!
CEM is a special case of CMA-ES : proof in paper.
(maybe this was already known?)
PI 2 / CEM / CMA-ES - Differences
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
Also some differences Evaluated on following task:
J(τ
t) = δ(t − 0.3) · ((x
t− 0.5) 2 + (y
t− 0.5) 2 ) +
P
Dd=1
(D + 1 − d)(¨ a
t) 2 P
Dd=1
(D + 1 − d) (5)
PI 2 / CEM / CMA-ES - Differences
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
Also some differences Evaluated on following task:
J(τ
t) = δ(t − 0.3) · ((x
t− 0.5) 2 + (y
t− 0.5) 2 ) +
P
Dd=1
(D + 1 − d)(¨ a
t) 2 P
Dd=1
(D + 1 − d) (5)
PI 2 / CEM / CMA-ES - Differences
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
Exploration Noise
PI
2θ
k,t∼ N (θ, Σ)
CEM
θ
k∼ N (θ, Σ)
CMA-ES
θ
k∼ N (θ, Σ)
PI 2 / CEM / CMA-ES - Differences
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
Exploration Noise
PI
2θ
k,t∼ N (θ, Σ)
CEM
θ
k∼ N (θ, Σ)
CMA-ES
θ
k∼ N (θ, Σ)
PI 2 / CEM / CMA-ES - Differences
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
Eliteness
PI
2CEM CMA-ES
PI 2 / CEM / CMA-ES - Differences
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
Eliteness
PI
2CEM CMA-ES
PI 2 / CEM / CMA-ES - Differences
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
Covariance Matrix Updating
PI
2No
CEM
Yes (simple)
CMA-ES
Yes (sophisticated)
PI 2 -CMA
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
Suggests a new algorithm: PI 2 -CMA Constant exploration noise Eliteness measure from PI
2Covariance matrix updating from CEM / CMA-ES
Basically PI 2 , but with adaptive exploration
In PI
2, exploration magnitude must be tuned by hand
Next evaluation demonstrates advantages of PI
2-CMA
PI 2 -CMA
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
Suggests a new algorithm: PI 2 -CMA Constant exploration noise Eliteness measure from PI
2Covariance matrix updating from CEM / CMA-ES Basically PI 2 , but with adaptive exploration
In PI
2, exploration magnitude must be tuned by hand
Next evaluation demonstrates advantages of PI
2-CMA
PI 2 -CMA- Adaptive Exploration
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
PI 2 -CMA- Adaptive Exploration
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
⇒ Exploration magnitude influences convergence speed and exploitation
PI 2 -CMA- Adaptive Exploration
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
PI 2 -CMA- Adaptive Exploration
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
PI 2 -CMA- Adaptive Exploration
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
More recent results (not in paper)
PI 2 -CMA- Adaptive Exploration
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
Summary
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
PI 2 / CEM / CMA-ES have identical update rules: reward-weighted averaging Apply covariance matrix adaptation (as in CEM / CMA-ES ) to PI 2
Novel algorithm PI 2 -CMA
With adaptive exploration (other algorithmic parameters trivial to tune) Future work
Further analysis and theoretical validation
Evaluation on real robots
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
Thank you for your attention!
Questions?
Bibliography
PI2 CEM CMA-ES
Comparison Evaluation Task
PI2-CMA Evaluation
Stulp, F. and Schaal, S. (2011).
Hierarchical reinforcement learning with motion primitives.
In11th IEEE-RAS International Conference on Humanoid Robots.
Stulp, F., Theodorou, E., Buchli, J., and Schaal, S. (2011a).
Learning to grasp under uncertainty.
InProceedings of the IEEE International Conference on Robotics and Automation (ICRA).
Stulp, F., Theodorou, E., Kalakrishnan, M., Pastor, P., Righetti, L., and Schaal, S. (2011b).
Learning motion primitive goals for robust manipulation.
InInternational Conference on Intelligent Robots and Systems (IROS).