Covariance Matrix Adaptation for Direct Reinforcement Learning

(1)

Covariance Matrix Adaptation for Direct Reinforcement Learning

Freek Stulp

Cognitive Robotics Group – ENSTA-ParisTech FLOWERS Team – INRIA/Bordeaux

Olivier Sigaud

Institut des Systmes Intelligents et de Robotique – UPMC CNRS UMR 7222

JFPDA 23.05.2012

(2)

Motivation

PI ² powerful algorithm for direct RL on robots

Tuning exploration parameter in PI

²

Grasping under Uncertainty

IROS’11, ICRA’11

Pick-and-Place Tasks

Humanoids’11

(3)

Motivation

PI ² powerful algorithm for direct RL on robots

Tuning exploration parameter in PI

²

(4)

Motivation

PI ² powerful algorithm for direct RL on robots Tuning exploration parameter in PI

²

Update rules for PI ² and CEM / CMA-ES almost identical

But CEM and CMA-ES additionaly tune exploration parameter automatically

Goals

Analysis and comparison of PI

²

/ CEM / CMA-ES Novel algorithm PI

²

-CMA

Essentially PI

²

with automatic exploration parameter tuning

(5)

Motivation

PI ² powerful algorithm for direct RL on robots Tuning exploration parameter in PI

²

Update rules for PI ² and CEM / CMA-ES almost identical

But CEM and CMA-ES additionaly tune exploration parameter automatically

Goals

Analysis and comparison of PI

²

/ CEM / CMA-ES Novel algorithm PI

²

-CMA

Essentially PI

²

with automatic exploration parameter tuning

(6)

Outline

PI ²

CEM CMA-ES

Reward-Weighted Averaging

Comparison

Evaluation Task

PI ² -CMA

Evaluation

Adaptive Exploration

(7)

Outline

PI ² CEM

CMA-ES

Reward-Weighted Averaging

Comparison

Evaluation Task

PI ² -CMA

Evaluation

Adaptive Exploration

(8)

Outline

PI ² CEM CMA-ES

Reward-Weighted Averaging

Comparison

Evaluation Task

PI ² -CMA

Evaluation

Adaptive Exploration

(9)

Outline

PI ² CEM CMA-ES

Reward-Weighted Averaging

Comparison

Evaluation Task

PI ² -CMA

Evaluation

Adaptive Exploration

(10)

Outline

PI ² CEM CMA-ES

Reward-Weighted Averaging

Comparison

Evaluation Task

PI ² -CMA

Evaluation

Adaptive Exploration

(11)

Outline

PI ² CEM CMA-ES

Reward-Weighted Averaging

Comparison

Evaluation Task

PI ² -CMA

Evaluation

Adaptive Exploration

(12)

Outline

PI ² CEM CMA-ES

Reward-Weighted Averaging

Comparison

Evaluation Task

PI ² -CMA

Evaluation

Adaptive Exploration

(13)

Outline

PI ² CEM CMA-ES

Reward-Weighted Averaging

Comparison

Evaluation Task

PI ² -CMA Evaluation

Adaptive Exploration

(14)

Outline

PI ² CEM CMA-ES

Reward-Weighted Averaging

Comparison

Evaluation Task

PI ² -CMA

Evaluation

Adaptive Exploration

(15)

PI ² - Derivation Outline

PI² CEM CMA-ES

Comparison Evaluation Task

PI²-CMA Evaluation

From first principles of Stochastic Optimal Control Start with Hamilton Jacobi Bellman equations

1

Log transformation + benign assumption

⇒ Linear!

2

Transform PDE to path integral with Feynman-Kac theorem

⇒ Roll-outs!

3

Apply to parameterized policies

⇒ Model free!

⇒ Iterative update rule for θ

(16)

PI ² - Algorithm

PI² CEM CMA-ES

PI²-CMA Evaluation

Words

Images

Formulae

(17)

Input:

DMP with initial parameters θ, cost function J

While (cost not converged) Explore

sample exploration vectors execute DMP

determine cost

Update compute prob. from cost prob.-weighted averaging parameter update

1 τ

¨xt=α(β(g−xt)−xt) +˙ gT t

(

θ

+t,k)

J(τi)=φtN+ ZtN

ti (qt+1 2

θT tRθt)dt

t,k∼ N(0,Σ) θk=θ+k

θ←θ+δθ

P τi,k

= e−1

λJ(τi,k)

PK k=1[e−1

λJ(τi,k) ]

δθti= K X

k=1 h

P

τi,k

Mti,kti,k i

(18)

Input: DMP with initial parameters θ

, cost function J While (cost not converged)

Explore

sample exploration vectors execute DMP

determine cost

Update compute prob. from cost prob.-weighted averaging parameter update

1 τ

(

θ

+t,k)

J(τi)=φtN+ ZtN

ti (qt+1 2θT

tRθt)dt

θ←θ+δθ

P τi,k

= e−1

λJ(τi,k)

PK k=1[e−1

λJ(τi,k) ]

δθti= K X

k=1 h

P

τi,k

Mti,kti,k i

(19)

Input: DMP with initial parameters θ, cost function J

While (cost not converged) Explore

sample exploration vectors execute DMP

determine cost

Update compute prob. from cost prob.-weighted averaging parameter update

1 τ

(

θ

+t,k)

J(τi)=φtN+ ZtN

ti (qt+1 2θT

tRθt)dt

θ←θ+δθ

P

τi,k

= e−1

λJ(τi,k)

PK k=1[e−1

λJ(τi,k) ]

δθti= K X

k=1 h

P

τi,k

Mti,kti,k i

(20)

Input: DMP with initial parameters θ, cost function J

While (cost not converged) Explore

sample exploration vectors execute DMP

determine cost

Update compute prob. from cost prob.-weighted averaging parameter update

1 τ

(

θ

+t,k)

J(τi)=φtN+ ZtN

ti (qt+1 2θT

tRθt)dt

θ←θ+δθ

P

τi,k

= e−1

λJ(τi,k)

PK k=1[e−1

λJ(τi,k) ]

δθti= K X

k=1 h

P

τi,k

Mti,kti,k i

(21)

Input: DMP with initial parameters θ, cost function J While (cost not converged)

Explore

sample exploration vectors execute DMP

determine cost

Update

compute prob. from cost prob.-weighted averaging parameter update

1 τ

(

θ

+t,k)

J(τi)=φtN+ ZtN

ti (qt+1 2θT

tRθt)dt

θ←θ+δθ

P

τi,k

= e−1

λJ(τi,k)

PK k=1[e−1

λJ(τi,k) ]

δθti= K X

k=1 h

P

τi,k

Mti,kti,k i

(22)

Input: DMP with initial parameters θ, cost function J While (cost not converged)

Explore

sample exploration vectors

execute DMP determine cost

Update

compute prob. from cost prob.-weighted averaging parameter update

1 τ

¨xt=α(β(g−xt)−xt) +˙ gT t(θ+t,k)

J(τi)=φtN+ ZtN

ti (qt+1 2

θT tRθt)dt

θ←θ+δθ

P τi,k

= e−1

λJ(τi,k)

PK k=1[e−1

λJ(τi,k) ]

δθti= K X

k=1 h

P

τi,k

Mti,kti,k i

(23)

Input: DMP with initial parameters θ, cost function J While (cost not converged)

Explore

sample exploration vectors execute DMP

determine cost

Update

compute prob. from cost prob.-weighted averaging parameter update

1 τ

J(τi)=φtN+ ZtN

ti (qt+1 2

θT tRθt)dt

θ←θ+δθ

P τi,k

= e−1

λJ(τi,k)

PK k=1[e−1

λJ(τi,k) ]

δθti= K X

k=1 h

P

τi,k

Mti,kti,k i

(24)

Input: DMP with initial parameters θ, cost function J While (cost not converged)

Explore

sample exploration vectors execute DMP

determine cost

Update

compute prob. from cost prob.-weighted averaging parameter update

1 τ

J(τi)=φtN+ ZtN

ti (qt+1 2

θT tRθt)dt

θ←θ+δθ

P τi,k

= e−1

λJ(τi,k)

PK k=1[e−1

λJ(τi,k) ]

δθti= K X

k=1 h

P

τi,k

Mti,kti,k i

(25)

Input: DMP with initial parameters θ, cost function J While (cost not converged)

Explore

sample exploration vectors execute DMP

determine cost

Update

compute prob. from cost prob.-weighted averaging parameter update

1 τ

J(τi)=φtN+ ZtN

ti (qt+1 2

θT tRθt)dt

θ←θ+δθ

P τi,k

= e−1

λJ(τi,k)

PK k=1[e−1

λJ(τi,k) ]

δθti= K X

k=1 h

P

τi,k

Mti,kti,k i

(26)

Input: DMP with initial parameters θ, cost function J While (cost not converged)

Explore

sample exploration vectors execute DMP

determine cost

Update compute prob. from cost

prob.-weighted averaging parameter update

1 τ

J(τi)=φtN+ ZtN

ti (qt+1 2

θT tRθt)dt

θ←θ+δθ

P τi,k

= e−1

λJ(τi,k)

PK k=1[e−1

λJ(τi,k) ]

δθti= K X

k=1 h

P

τi,k

Mti,kti,k i

(27)

Input: DMP with initial parameters θ, cost function J While (cost not converged)

Explore

sample exploration vectors execute DMP

determine cost

Update compute prob. from cost prob.-weighted averaging

parameter update

1 τ

J(τi)=φtN+ ZtN

ti (qt+1 2

θT tRθt)dt

θ←θ+δθ

P τi,k

= e−1

λJ(τi,k)

PK k=1[e−1

λJ(τi,k) ]

δθti= K X

k=1 h

P

τi,k

Mti,kti,k i

(28)

Input: DMP with initial parameters θ, cost function J While (cost not converged)

Explore

sample exploration vectors execute DMP

determine cost

Update compute prob. from cost prob.-weighted averaging parameter update

1 τ

J(τi)=φtN+ ZtN

ti (qt+1 2

θT tRθt)dt

θ←θ+δθ

P τi,k

= e−1

λJ(τi,k)

PK k=1[e−1

λJ(τi,k) ]

δθti= K X

k=1 h

P

τi,k

Mti,kti,k i

(29)

Input: DMP with initial parameters θ, cost function J While (cost not converged)

Explore

sample exploration vectors execute DMP

determine cost

Update compute prob. from cost prob.-weighted averaging parameter update

1 τ

J(τi)=φtN+ ZtN

ti (qt+1 2

θT tRθt)dt

θ←θ+δθ

P τi,k

= e−1

λJ(τi,k)

PK k=1[e−1

λJ(τi,k) ]

δθti= K X

k=1 h

P

τi,k

Mti,kti,k i

(30)

PI ² - Algorithm

PI² CEM CMA-ES

PI²-CMA Evaluation

Advantages

No gradient ⇒ Deals with discontinous noisy cost functions Update δ within convex hull of

k=1...K

⇒ Safe update rule Arbitrary cost functions

Model-free Fast convergence

Only one open parameter: magnitude of exploration (

t,k

∼ N(0, Σ)) Disadvantage

No global convergence guarantees...

Robotics: This is where imitation comes in! π(θ

^imit

) ≈ π(θ

^∗

)

Applied to very high-dimensional, complex tasks...

(31)

Cross-Entropy Method ( CEM )

PI² CEM CMA-ES

PI²-CMA Evaluation

θ_k=1...K∼

N(θ,

Σ)

∀k J_k

=

J(θ_k

)

θ_k=1...K←sortθ_k=1...Kw.r.tJ_k=1...K θ^new

=

Ke

X

k=1

1

K_eθ_k

Σ

^new

=

Ke

X

k=1

1

Ke

(θ

k−θ)(θk−θ)^|

This algorithm can be interpreted as performing reward-weighted averaging

(32)

Cross-Entropy Method ( CEM )

PI² CEM CMA-ES

PI²-CMA Evaluation

θ_k=1...K∼ N(θ,

Σ)

∀k J_k

=

J(θ_k

)

=

Ke

X

k=1

1

K_eθ_k

Σ

^new

=

Ke

X

k=1

1

Ke

(θ

k−θ)(θk−θ)^|

This algorithm can be interpreted as performing reward-weighted averaging

(33)

Cross-Entropy Method ( CEM )

PI² CEM CMA-ES

PI²-CMA Evaluation

θ_k=1...K∼ N(θ,

Σ)

∀k J_k

=

J(θ_k

)

=

Ke

X

k=1

1

K_eθ_k

Σ

^new

=

Ke

X

k=1

1

Ke

(θ

k−θ)(θk−θ)^|

This algorithm can be interpreted as performing reward-weighted averaging

(34)

Cross-Entropy Method ( CEM )

PI² CEM CMA-ES

PI²-CMA Evaluation

θ_k=1...K∼ N(θ,

Σ)

∀k J_k

=

J(θ_k

)

θ_k=1...K←sortθ_k=1...Kw.r.tJ_k=1...K

θ^new

=

Ke

X

k=1

1

K_eθ_k

Σ

^new

=

Ke

X

k=1

1

Ke

(θ

k−θ)(θk−θ)^|

This algorithm can be interpreted as performing reward-weighted averaging

(35)

Cross-Entropy Method ( CEM )

PI² CEM CMA-ES

PI²-CMA Evaluation

θ_k=1...K∼ N(θ,

Σ)

∀k J_k

=

J(θ_k

)

=

Ke

X

k=1

1

K_eθ_k

Σ

^new

=

Ke

X

k=1

1

Ke

(θ

k−θ)(θk−θ)^|

This algorithm can be interpreted as performing reward-weighted averaging

(36)

Cross-Entropy Method ( CEM )

PI² CEM CMA-ES

PI²-CMA Evaluation

θ_k=1...K∼ N(θ,

Σ)

∀k J_k

=

J(θ_k

)

=

Ke

X

k=1

1

K_eθ_k

Σ

^new

=

Ke

X

k=1

1

Ke

(θ

k−θ)(θk−θ)^|

This algorithm can be interpreted as performing reward-weighted averaging

(37)

Cross-Entropy Method ( CEM )

PI² CEM CMA-ES

PI²-CMA Evaluation

θ_k=1...K∼ N(θ,

Σ)

∀k J_k

=

J(θ_k

)

=

Ke

X

k=1

1

K_eθ_k

Σ

^new

=

Ke

X

k=1

1

Ke

(θ

k−θ)(θk−θ)^|

This algorithm can be interpreted as performing reward-weighted averaging

(38)

CMA-ES

PI² CEM CMA-ES

PI²-CMA Evaluation

Covariance Matrix Adaptation - Evolutionary Strategy Like CEM , but

Different mapping from cost to probability

More sophisticated method for updating covariance matrix:

pσ←

(1

−cσ

)

pσ

+

q

cσ

(2

−cσ

)µ

P

Σ

⁻¹θ^new−θ

σ

(1)

σnew

=

σ×

exp

c_σ

dσ

kpσk

EkN

(0,

I)k−

1 (2)

p_Σ←

(1

−c_Σ

)

p_Σ

+

hσ

q

c_Σ

(2

−c_Σ

)µ

_Pθ^new−θ

σ

(3)

Σ

^new

= (1

−c₁−c_µ

) Σ +

c₁

(p

_Σp^T_Σ

+

δ(h_σ

)Σ)

+

c_µ

Ke

X

k=1

P_k

(θ

_k−θ)(θ_k−θ)^|

(4)

(39)

PI ² / CEM / CMA-ES - Similarities

PI² CEM CMA-ES

PI²-CMA Evaluation

PI ² / CEM / CMA-ES are all based on

Exploration: sample from a Gaussian θ

k

∼ N (θ, Σ) Parameter update: Reward-Weighted Averaging θ

^new

= P

K

k=1

P

k

θ

k

Similarity striking, as algorithms derived from very different principles!

CEM is a special case of CMA-ES : proof in paper.

(maybe this was already known?)

(40)

PI ² / CEM / CMA-ES - Similarities

PI² CEM CMA-ES

PI²-CMA Evaluation

PI ² / CEM / CMA-ES are all based on

Exploration: sample from a Gaussian θ

k

∼ N (θ, Σ) Parameter update: Reward-Weighted Averaging θ

^new

= P

K

k=1

P

k

θ

k

Similarity striking, as algorithms derived from very different principles!

CEM is a special case of CMA-ES : proof in paper.

(maybe this was already known?)

(41)

PI ² / CEM / CMA-ES - Similarities

PI² CEM CMA-ES

PI²-CMA Evaluation

PI ² / CEM / CMA-ES are all based on

Exploration: sample from a Gaussian θ

k

∼ N (θ, Σ) Parameter update: Reward-Weighted Averaging θ

^new

= P

K

k=1

P

k

θ

k

Similarity striking, as algorithms derived from very different principles!

CEM is a special case of CMA-ES : proof in paper.

(maybe this was already known?)

(42)

PI ² / CEM / CMA-ES - Differences

PI² CEM CMA-ES

PI²-CMA Evaluation

Also some differences Evaluated on following task:

J(τ

t

) = δ(t − 0.3) · ((x

t

− 0.5) ² + (y

t

− 0.5) ² ) +

P

D

d=1

(D + 1 − d)(¨ a

t

) ² P

D

d=1

(D + 1 − d) (5)

(43)

PI ² / CEM / CMA-ES - Differences

PI² CEM CMA-ES

PI²-CMA Evaluation

Also some differences Evaluated on following task:

J(τ

t

) = δ(t − 0.3) · ((x

t

− 0.5) ² + (y

t

− 0.5) ² ) +

P

D

d=1

(D + 1 − d)(¨ a

t

) ² P

D

d=1

(D + 1 − d) (5)

(44)

PI ² / CEM / CMA-ES - Differences

PI² CEM CMA-ES

PI²-CMA Evaluation

Exploration Noise

PI

²

θ

k,t

∼ N (θ, Σ)

CEM

θ

k

∼ N (θ, Σ)

CMA-ES

θ

k

∼ N (θ, Σ)

(45)

PI ² / CEM / CMA-ES - Differences

PI² CEM CMA-ES

PI²-CMA Evaluation

Exploration Noise

PI

²

θ

k,t

∼ N (θ, Σ)

CEM

θ

k

∼ N (θ, Σ)

CMA-ES

θ

k

∼ N (θ, Σ)

(46)

PI ² / CEM / CMA-ES - Differences

PI² CEM CMA-ES

PI²-CMA Evaluation

Eliteness

PI

²

CEM CMA-ES

(47)

PI ² / CEM / CMA-ES - Differences

PI² CEM CMA-ES

PI²-CMA Evaluation

Eliteness

PI

²

CEM CMA-ES

(48)

PI ² / CEM / CMA-ES - Differences

PI² CEM CMA-ES

PI²-CMA Evaluation

Covariance Matrix Updating

PI

²

No

CEM

Yes (simple)

CMA-ES

Yes (sophisticated)

(49)

PI ² -CMA

PI² CEM CMA-ES

PI²-CMA Evaluation

Suggests a new algorithm: PI ² -CMA Constant exploration noise Eliteness measure from PI

²

Covariance matrix updating from CEM / CMA-ES

Basically PI ² , but with adaptive exploration

In PI

²

, exploration magnitude must be tuned by hand

Next evaluation demonstrates advantages of PI

²

-CMA

(50)

PI ² -CMA

PI² CEM CMA-ES

PI²-CMA Evaluation

Suggests a new algorithm: PI ² -CMA Constant exploration noise Eliteness measure from PI

²

Covariance matrix updating from CEM / CMA-ES Basically PI ² , but with adaptive exploration

In PI

²

, exploration magnitude must be tuned by hand

Next evaluation demonstrates advantages of PI

²

-CMA

(51)

PI ² -CMA- Adaptive Exploration

PI² CEM CMA-ES

PI²-CMA Evaluation

(52)

PI ² -CMA- Adaptive Exploration

PI² CEM CMA-ES

PI²-CMA Evaluation

⇒ Exploration magnitude influences convergence speed and exploitation

(53)

PI ² -CMA- Adaptive Exploration

PI² CEM CMA-ES

PI²-CMA Evaluation

(54)

PI ² -CMA- Adaptive Exploration

PI² CEM CMA-ES

PI²-CMA Evaluation

(55)

PI ² -CMA- Adaptive Exploration

PI² CEM CMA-ES

PI²-CMA Evaluation

More recent results (not in paper)

(56)

PI ² -CMA- Adaptive Exploration

PI² CEM CMA-ES

PI²-CMA Evaluation

(57)

Summary

PI² CEM CMA-ES

PI²-CMA Evaluation

PI ² / CEM / CMA-ES have identical update rules: reward-weighted averaging Apply covariance matrix adaptation (as in CEM / CMA-ES ) to PI ²

Novel algorithm PI ² -CMA

With adaptive exploration (other algorithmic parameters trivial to tune) Future work

Further analysis and theoretical validation

Evaluation on real robots

(58)

PI² CEM CMA-ES

PI²-CMA Evaluation

Thank you for your attention!

Questions?

(59)

Bibliography

PI² CEM CMA-ES

PI²-CMA Evaluation

Stulp, F. and Schaal, S. (2011).

Hierarchical reinforcement learning with motion primitives.

In11th IEEE-RAS International Conference on Humanoid Robots.

Stulp, F., Theodorou, E., Buchli, J., and Schaal, S. (2011a).

Learning to grasp under uncertainty.

InProceedings of the IEEE International Conference on Robotics and Automation (ICRA).

Stulp, F., Theodorou, E., Kalakrishnan, M., Pastor, P., Righetti, L., and Schaal, S. (2011b).

Learning motion primitive goals for robust manipulation.

InInternational Conference on Intelligent Robots and Systems (IROS).

Covariance Matrix Adaptation for Direct Reinforcement Learning