CMA-ES and Advanced Adaptation Mechanisms

(1)

HAL Id: hal-01959479

https://hal.inria.fr/hal-01959479

Submitted on 18 Dec 2018

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

CMA-ES and Advanced Adaptation Mechanisms

Youhei Akimoto, Nikolaus Hansen

To cite this version:

Youhei Akimoto, Nikolaus Hansen. CMA-ES and Advanced Adaptation Mechanisms. GECCO ’18 Companion: Proceedings of the Genetic and Evolutionary Computation Conference Companion, Jul 2018, Kyoto, Japan. �hal-01959479�

(2)

CMA-ES and Advanced Adaptation Mechanisms

Youhei Akimoto ¹ & Nikolaus Hansen ²

1. University of Tsukuba, Japan

2. Inria, Research Centre Saclay, France akimoto@cs.tsukuba.ac.jp

nikolaus.hansen@inria.fr

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial

advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

GECCO '18 Companion, July 15–19, 2018, Kyoto, Japan

ACM ISBN 978-1-4503-5764-7/18/07.

https://doi.org/10.1145/3205651.3207854

(3)

Tutorial: Evolution Strategies and CMA-ES (Covariance Matrix Adaptation)

Anne Auger & Nikolaus Hansen

Inria

Project team TAO

Research Centre Saclay – ˆIle-de-France

University Paris-Sud, LRI (UMR 8623), Bat. 660 91405 ORSAY Cedex, France

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Copyright is held by the owner/author(s).

GECCO ’14, Jul 12-16 2014, Vancouver, BC, Canada ACM 978-1-4503-2881-4/14/07.

https://www.lri.fr/~hansen/gecco2014-CMA-ES-tutorial.pdf

We are happy to answer questions at any time.

(4)

Topics

1. What makes the problem difficult to solve?

2. How does the CMA-ES work?

• Normal Distribution, Rank-Based Recombination

• Step-Size Adaptation

• Covariance Matrix Adaptation

3. What can/should the users do for the CMA-ES to work effectively on their problem?

• Choice of problem formulation and encoding (not covered)

• Choice of initial solution and initial step-size

• Restarts, Increasing Population Size

• Restricted Covariance Matrix

(5)

Topics

1. What makes the problem difficult to solve?

2. How does the CMA-ES work?

• Normal Distribution, Rank-Based Recombination

• Step-Size Adaptation

• Covariance Matrix Adaptation

3. What can/should the users do for the CMA-ES to work effectively on their problem?

• Choice of problem formulation and encoding (not covered)

• Choice of initial solution and initial step-size

• Restarts, Increasing Population Size

(6)

Problem Statement Black Box Optimization and Its Difficulties

Problem Statement

Continuous Domain Search/Optimization

Task: minimize an objective function (fitness function, loss function) in continuous domain

f : X ✓ R ⁿ ! R , x 7! f ( x ) Black Box scenario (direct search scenario)

f(x) x

I

gradients are not available or not useful

I

problem domain specific knowledge is used only within the black box, e.g. within an appropriate encoding

Search costs: number of function evaluations

(7)

Problem Statement

Continuous Domain Search/Optimization

Goal

I

fast convergence to the global optimum

. . . or to a robust solution x

I

solution x with small function value f ( x ) with least search cost

there are two conflicting objectives

Typical Examples

I

shape optimization (e.g. using CFD) curve fitting, airfoils

I

model calibration biological, physical

I

parameter calibration controller, plants, images

Problems

I

exhaustive search is infeasible

I

naive random search takes too long

I

deterministic search is not successful / takes too long

(8)

What Makes a Function Difficult to Solve?

Why stochastic search?

non-linear, non-quadratic, non-convex

on linear and quadratic functions much better search policies are available

ruggedness

non-smooth, discontinuous, multimodal, and/or noisy function

dimensionality (size of search space)

(considerably) larger than three

non-separability

dependencies between the objective variables

ill-conditioning

−4 −3 −2 −1 0 1 2 3 4

0 10 20 30 40 50 60 70 80 90 100

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

gradient direction Newton direction

non-smooth level sets

(9)

Ruggedness

non-smooth, discontinuous, multimodal, and/or noisy

Fitness

−4 −3 −2 −1 0 1 2 3 4

0 10 20 30 40 50 60 70 80 90 100

cut from a 5-D example, (easily) solvable with evolution strategies

(10)

Problem Statement Non-Separable Problems

Separable Problems

Definition (Separable Problem)

A function f is separable if arg min

(x

₁

,...,x

_n

) f (x ₁ , . . . , x _n ) =

✓

arg min

x

₁

f (x ₁ , . . .), . . . , arg min

x

_n

f (. . . , x _n )

◆ ) it follows that f can be optimized in a sequence of n independent 1-D optimization processes

Example: Additively

decomposable functions

f (x ₁ , . . . , x _n ) =

X n i=1

f _i (x _i )

Rastrigin function

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

(11)

Problem Statement Non-Separable Problems

Non-Separable Problems

Building a non-separable problem from a separable one

⁽¹^,2⁾

Rotating the coordinate system

f : x 7! f ( x ) separable

f : x 7! f ( Rx ) non-separable

R rotation matrix

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

R

!

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

1Hansen, Ostermeier, Gawelczyk (1995). On the adaptation of arbitrary normal mutation distributions in evolution strategies:

The generating set adaptation. Sixth ICGA, pp. 57-64, Morgan Kaufmann

(12)

Problem Statement Ill-Conditioned Problems

Ill-Conditioned Problems

Curvature of level sets

Consider the convex-quadratic function f ( x ) = ¹ ₂ ( x x ^⇤ ) ^T H ( x x ^⇤ ) = ¹ ₂ P

i h _i,i (x _i x ^⇤ _i ) ² + ¹ ₂ P

i 6 =j h _i,j (x _i x ^⇤ _i )(x _j x ^⇤ _j )

H is Hessian matrix of f and symmetric positive definite

gradient direction f ⁰ ( x ) ^T

Newton direction H ¹ f ⁰ ( x ) ^T

Ill-conditioning means squeezed level sets (high curvature).

Condition number equals nine here. Condition numbers up to 10

¹⁰

are not unusual in real world problems.

If H ⇡ I (small condition number of H ) first order information (e.g. the

gradient) is sufficient. Otherwise second order information (estimation

of H ¹ ) is necessary.

(13)

Section Subsection

Non-smooth level sets (sharp ridges)

Similar difficulty but worse than ill-conditioning

1-norm scaled 1-norm 1/2-norm

(14)

Problem Statement Ill-Conditioned Problems

What Makes a Function Difficult to Solve?

. . . and what can be done

The Problem Possible Approaches Dimensionality exploiting the problem structure

separability, locality/neighborhood, encoding

Ill-conditioning second order approach

changes the neighborhood metric

Ruggedness non-local policy, large sampling width (step-size)

as large as possible while preserving a reasonable convergence speed

population-based method, stochastic, non-elitistic recombination operator

serves as repair mechanism

restarts

. . .metaphors

(15)

Topics

1. What makes the problem difficult to solve?

2. How does the CMA-ES work?

• Normal Distribution, Rank-Based Recombination

• Step-Size Adaptation

• Covariance Matrix Adaptation

3. What can/should the users do for the CMA-ES to work effectively on their problem?

• Choice of problem formulation and encoding (not covered)

• Choice of initial solution and initial step-size

• Restarts, Increasing Population Size

(16)

Evolution Strategies (ES) A Search Template

Stochastic Search

A black box search template to minimize f : R ⁿ ! R

Initialize distribution parameters ✓, set population size 2 N While not terminate

1

Sample distribution P ( x | ✓) ! x ₁ , . . . , x 2 R ⁿ

2

Evaluate x ₁ , . . . , x on f

3

Update parameters ✓ F _✓ (✓ , x ₁ , . . . , x , f ( x ₁ ), . . . , f ( x ))

Everything depends on the definition of P and F _✓

deterministic algorithms are covered as well

In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation

Natural template for (incremental) Estimation of Distribution Algorithms

(17)

Stochastic Search

A black box search template to minimize f : R ⁿ ! R

Initialize distribution parameters ✓, set population size 2 N While not terminate

1

Sample distribution P ( x | ✓) ! x ₁ , . . . , x 2 R ⁿ

2

Evaluate x ₁ , . . . , x on f

3

Update parameters ✓ F _✓ (✓ , x ₁ , . . . , x , f ( x ₁ ), . . . , f ( x ))

Everything depends on the definition of P and F _✓

deterministic algorithms are covered as well

In many Evolutionary Algorithms the distribution P is implicitly defined

via operators on a population, in particular, selection, recombination

and mutation

(18)

Stochastic Search

A black box search template to minimize f : R ⁿ ! R

Initialize distribution parameters ✓, set population size 2 N While not terminate

1

Sample distribution P ( x | ✓) ! x ₁ , . . . , x 2 R ⁿ

2

Evaluate x ₁ , . . . , x on f

3

Update parameters ✓ F _✓ (✓ , x ₁ , . . . , x , f ( x ₁ ), . . . , f ( x ))

Everything depends on the definition of P and F _✓

deterministic algorithms are covered as well

In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation

Natural template for (incremental) Estimation of Distribution Algorithms

(19)

Stochastic Search

A black box search template to minimize f : R ⁿ ! R

Initialize distribution parameters ✓, set population size 2 N While not terminate

1

Sample distribution P ( x | ✓) ! x ₁ , . . . , x 2 R ⁿ

2

Evaluate x ₁ , . . . , x on f

3

Update parameters ✓ F _✓ (✓ , x ₁ , . . . , x , f ( x ₁ ), . . . , f ( x ))

Everything depends on the definition of P and F _✓

deterministic algorithms are covered as well

In many Evolutionary Algorithms the distribution P is implicitly defined

via operators on a population, in particular, selection, recombination

and mutation

(20)

Stochastic Search

A black box search template to minimize f : R ⁿ ! R

Initialize distribution parameters ✓, set population size 2 N While not terminate

1

Sample distribution P ( x | ✓) ! x ₁ , . . . , x 2 R ⁿ

2

Evaluate x ₁ , . . . , x on f

3

Update parameters ✓ F _✓ (✓ , x ₁ , . . . , x , f ( x ₁ ), . . . , f ( x ))

Everything depends on the definition of P and F _✓

deterministic algorithms are covered as well

In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation

Natural template for (incremental) Estimation of Distribution Algorithms

(21)

Stochastic Search

A black box search template to minimize f : R ⁿ ! R

Initialize distribution parameters ✓, set population size 2 N While not terminate

1

Sample distribution P ( x | ✓) ! x ₁ , . . . , x 2 R ⁿ

2

Evaluate x ₁ , . . . , x on f

3

Update parameters ✓ F _✓ (✓ , x ₁ , . . . , x , f ( x ₁ ), . . . , f ( x ))

Everything depends on the definition of P and F _✓

deterministic algorithms are covered as well

In many Evolutionary Algorithms the distribution P is implicitly defined

via operators on a population, in particular, selection, recombination

and mutation

(22)

Stochastic Search

A black box search template to minimize f : R ⁿ ! R

Initialize distribution parameters ✓, set population size 2 N While not terminate

1

Sample distribution P ( x | ✓) ! x ₁ , . . . , x 2 R ⁿ

2

Evaluate x ₁ , . . . , x on f

3

Update parameters ✓ F _✓ (✓ , x ₁ , . . . , x , f ( x ₁ ), . . . , f ( x ))

Everything depends on the definition of P and F _✓

deterministic algorithms are covered as well

In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation

Natural template for (incremental) Estimation of Distribution Algorithms

(23)

Stochastic Search

A black box search template to minimize f : R ⁿ ! R

Initialize distribution parameters ✓, set population size 2 N While not terminate

1

Sample distribution P ( x | ✓) ! x ₁ , . . . , x 2 R ⁿ

2

Evaluate x ₁ , . . . , x on f

3

Update parameters ✓ F _✓ (✓ , x ₁ , . . . , x , f ( x ₁ ), . . . , f ( x ))

Everything depends on the definition of P and F _✓

deterministic algorithms are covered as well

In many Evolutionary Algorithms the distribution P is implicitly defined

via operators on a population, in particular, selection, recombination

and mutation

(24)

The CMA-ES

Input: m 2 R ⁿ , 2 R ⁺ ,

Initialize: C = I , and p _c = 0 , p = 0 ,

Set: c _c ⇡ 4/n, c ⇡ 4/n, c ₁ ⇡ 2/n ² , c _µ ⇡ µ _w /n ² , c ₁ + c _µ  1, d ⇡ 1 + p _µ

_w

n , and w _i=1... such that µ _w = ^P

^µ

¹

i=1

w

_i²

⇡ 0.3 While not terminate

x i = m + y i , y i ⇠ N ⁱ ( 0 , C ) , for i = 1 , . . . , sampling m P µ

i=1 w _i x i: = m + y w where y w = P µ

i=1 w _i y i: update mean p _c ( 1 c _c ) p _c + 1I _{k _p _k _<1.5 ^p _n _} p

1 ( 1 c _c ) ² p µ _w y w cumulation for C p ( 1 c ) p + p

1 ( 1 c ) ² p µ _w C

¹²

y w cumulation for C ( 1 c ₁ c _µ ) C + c ₁ p _c p _c ^T + c _µ P _µ

i=1 w _i y i: y ^T _i: update C ⇥ exp ⇣

d c

⇣ k p k

E kN ( 0 , I ) k 1 ⌘⌘

update of

Not covered on this slide: termination, restarts, useful output, boundaries and encoding

Input: m œ R ⁿ ; ‡ œ R + ; ⁄ œ N _Ø 2 , usually ⁄ Ø 5, default 4 + Â 3 log n Ê

Set c _m = 1; c ₁ ¥ 2/n ² ; c _µ ¥ µ _w /n ² ; c _c ¥ 4/n; c _‡ ¥ 1/ Ô n; d _‡ ¥ 1; w _i ₌₁ _...⁄

decreasing in i and q µ

i w _i = 1, w _µ > 0 Ø w _µ ₊₁ , µ ^≠ _w ¹ := q µ

i =1 w _i ² ¥ 3 /⁄

Initialize C = I, and p _c = 0, p _‡ = 0 While not terminate

x _i = m + ‡ y _i , where y _i ≥ N ⁱ (0 , C) for i = 1 , . . . , ⁄ sampling m Ω m + c _m ‡ y _w , where y _w = q µ

i =1 w _rk( _i ₎ y _i update mean p _‡ Ω (1 ≠ c _‡ ) p _‡ + 

1 ≠ (1 ≠ c _‡ ) ² Ô µ _w C ^≠

¹²

y _w path for ‡ p _c Ω (1 ≠ c _c ) p _c + 1I _[0 _, ₂ _n _] )

Î p _‡ Î ² * 

1 ≠ (1 ≠ c _c ) ² Ô µ _w y _w path for C

‡ Ω ‡ ◊ exp 1

c

_‡

d

_‡

1 Î p

_‡

Î

E ÎN (0 , I) Î ≠ 1 22

update of ‡ C Ω C + c _µ q ⁄

i =1 w _rk( _i ₎ ( y _i y ^T _i ≠ C) + c ₁ ( p _c p ^T _c ≠ C) update C

Not covered: termination, restarts, useful output, search boundaries and encoding,

corrections for: positive definiteness guaranty, p

_c

variance loss, c

_‡

and d

_‡

for large ⁄

(25)

Evolution Strategies

New search points are sampled normally distributed

x _i ⇠ m + N ⁱ ( 0 , C ) for i = 1 , . . . ,

as perturbations of m, where x i , m 2 R ⁿ , 2 R ⁺ , C 2 R ⁿ ^⇥ ⁿ

where

the mean vector m 2 R ⁿ represents the favorite solution the so-called step-size 2 R ⁺ controls the step length the covariance matrix C 2 R ⁿ ^⇥ ⁿ determines the shape of the distribution ellipsoid

here, all new points are sampled with the same parameters

The question remains how to update m, C , and .

(26)

Evolution Strategies (ES) The Normal Distribution

Why Normal Distributions?

1

widely observed in nature, for example as phenotypic traits

2

only stable distribution with finite variance

stable means that the sum of normal variates is again normal:

N ( x , A ) + N ( y , B ) ⇠ N ( x + y , A + B )

helpful in design and analysis of algorithms related to the central limit theorem

3

most convenient way to generate isotropic search points

the isotropic distribution does not favor any direction, rotational invariant

4

maximum entropy distribution with finite variance

the least possible assumptions on f in the distribution shape

(27)

Normal Distribution

−40 −2 0 2 4

0.1 0.2 0.3

0.4 Standard Normal Distribution

probability density

probability density of the 1-D standard normal distribution

−5

0

5

−5 0

50 0.1 0.2 0.3 0.4

2−D Normal Distribution

probability density of a 2-D normal

distribution

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

(28)

The Multi-Variate (n-Dimensional) Normal Distribution

Any multi-variate normal distribution N ( m , C ) is uniquely determined by its mean value m 2 R

ⁿ

and its symmetric positive definite n ⇥ n covariance matrix C .

The mean value m

determines the displacement (translation) value with the largest density (modal value)

the distribution is symmetric about the distribution mean

−5

0

5

−5 0

50 0.1 0.2 0.3 0.4

2−D Normal Distribution

The covariance matrix C determines the shape

geometrical interpretation: any covariance matrix can be uniquely identified with

the iso-density ellipsoid { x 2 R

ⁿ

| ( x m )

^T

C

¹

( x m ) = n 1 }

(29)

. . . any covariance matrix can be uniquely identified with the iso-density ellipsoid { x 2 R

ⁿ

| ( x m )

^T

C

¹

( x m ) = 1 }

Lines of Equal Density

N m ,

²

I ⇠ m + N ( 0 , I )

one degree of freedom components are

independent standard normally distributed

N m , D

²

⇠ m + D N ( 0 , I )

n degrees of freedom components are

independent, scaled

N ( m , C ) ⇠ m + C

¹²

N ( 0 , I )

(n

²

+ n)/2 degrees of freedom components are

correlated

where I is the identity matrix (isotropic case) and D is a diagonal matrix (reasonable for separable problems) and A ⇥ N ( 0 , I ) ⇠ N 0 , AA

^T

holds for all A .

n

(30)

Evolution Strategies (ES) The Normal Distribution

Multivariate Normal Distribution and Eigenvalues

d

₂

· b

₂

For any positive definite symmetric C ,

C = d ² ₁ b ₁ b ^T ₁ + · · · + d ² _N b _N b ^T _N

d _i : square root of the eigenvalue of C

b _i : eigenvector of C , corresponding to d _i The multivariate normal distribution N (m, C)

N (m, C) ⇠ m + N (0, d ² ₁ )b ₁ + · · · + N (0, d ² _N )b _N

(31)

The (µ/µ, )-ES

Non-elitist selection and intermediate (weighted) recombination

Given the i-th solution point x _i = m + N ⁱ ( 0 , C )

| {z }

=: y

_i

= m + y _i

Let x _i: the i-th ranked solution point, such that f ( x ₁ : )  · · ·  f ( x : ).

The new mean reads m

X µ i=1

w _i x _i: = m +

X µ i=1

w _i y _i:

| {z }

=: y

w

where

w ₁ · · · w _µ > 0 , P _µ

i=1 w _i = 1 , ^P

^µ

¹

i=1

w

_i²

=: µ _w ⇡ ₄

The best µ points are selected from the new solutions (non-elitistic)

(32)

Evolution Strategies (ES) Invariance

Invariance Under Monotonically Increasing Functions

Rank-based algorithms

Update of all parameters uses only the ranks

f (x

1:

)  f (x

2:

)  ...  f (x

:

)

g(f (x

1:

))  g(f (x

2:

))  ...  g(f (x

:

)) 8 g

g is strictly monotonically increasing g preserves ranks

3

3Whitley 1989. The GENITOR algorithm and selection pressure: Why rank-based allocation of reproductive trials is best, ICGA

(33)

Evolution Strategies (ES) Invariance

Basic Invariance in Search Space

translation invariance

is true for most optimization algorithms

f ( x ) $ f ( x a )

Identical behavior on f and f _a

f : x 7! f ( x ), x ^(t=0) = x ₀

f _a : x 7! f ( x a ), x ^(t=0) = x ₀ + a

No difference can be observed w.r.t. the argument of f

(34)

Evolution Strategies (ES) Summary

Summary

On 20D Sphere Function: f ( x ) = P

_N

i=1

[ x ]

²_i

ES without adaptation can’t approach the optimum ) adaptation required

(35)

Step-Size Control

Evolution Strategies

Recalling

New search points are sampled normally distributed

x i ⇠ m + N ⁱ ( 0 , C ) for i = 1 , . . . ,

as perturbations of m, where x i , m 2 R ⁿ , 2 R ⁺ , C 2 R ⁿ ^⇥ ⁿ

where

the mean vector m 2 R ⁿ represents the favorite solution and m P _µ

i=1 w _i x _i:

the so-called step-size 2 R ⁺ controls the step length the covariance matrix C 2 R ⁿ ^⇥ ⁿ determines the shape of the distribution ellipsoid

The remaining question is how to update and C .

(36)

Step-Size Control Why Step-Size Control

Methods for Step-Size Control

1 / 5-th success rule

^ab

, often applied with “+”-selection

increase step-size if more than 20% of the new solutions are successful, decrease otherwise

-self-adaptation

^c

, applied with “,”-selection

mutation is applied to the step-size and the better, according to the objective function value, is selected

simplified “global” self-adaptation

path length control

^d

(Cumulative Step-size Adaptation, CSA)

^e

self-adaptation derandomized and non-localized

aRechenberg 1973, Evolutionsstrategie, Optimierung technischer Systeme nach Prinzipien der biologischen Evolution, Frommann-Holzboog

bSchumer and Steiglitz 1968. Adaptive step size random search. IEEE TAC cSchwefel 1981, Numerical Optimization of Computer Models, Wiley

dHansen & Ostermeier 2001, Completely Derandomized Self-Adaptation in Evolution Strategies, Evol. Comput.

9(2) eOstermeier et al 1994, Step-size adaptation based on non-local use of selection information, PPSN IV

(37)

Step-Size Control Path Length Control (CSA)

Path Length Control (CSA)

The Concept of Cumulative Step-Size Adaptation

x

i

= m + y

i

m m + y

^w

Measure the length of the evolution path

the pathway of the mean vector m in the generation sequence

decrease + +

increase loosely speaking steps are

perpendicular under random selection (in expectation)

(38)

Path Length Control (CSA)

The Equations

Initialize m 2 R ⁿ , 2 R ⁺ , evolution path p = 0 , set c ⇡ 4 /n, d ⇡ 1.

m m + y w where y w = P _µ

i=1 w _i y _i: update mean p ( 1 c ) p +

q 1 ( 1 c ) ²

| {z }

accounts for 1 c

p µ _w

accounts for |{z} w

_i

y w

⇥ exp

✓ c d

✓ k p k

E kN ( 0 , I ) k 1

◆◆

| {z }

>1 () k p k is greater than its expectation

update step-size

(39)

( 5 / 5 , 10 )-CSA-ES, default parameters

k m x ⇤ k

f ( x ) =

X n i=1

x ² _i

in [ 0 . 2 , 0 . 8 ] ⁿ

for n = 30

(40)

Step-Size Control Alternatives to CSA

Two-Point Step-Size Adaptation (TPA)

k x k

C

:= x

^T

C

¹

x Sample a pair of symmetric points along the previous mean shift

x

_1/2

= m

^(g)

±

^(g)

kN ( 0 , I ) k

k m

^(g)

m

^(g ¹⁾

k

_C^(g)

( m

^(g)

m

^(g ¹⁾

)

Compare the ranking of x

1

and x

2

among current populations s

^(g+1)

= (1 c

_s

)s

^(g)

+ c

_s

rank ( x

2

) rank ( x

1

)

| {z 1 }

>0 if the previous step still produces a promising solution

Update the step-size

(g+1)

=

^(g)

exp s

^(g+1)

d

!

[Hansen, 2008] Hansen, N. (2008). CMA-ES with two-point step-size adaptation. [research report] rr-6527, 2008. Inria-00276854v5.

[Hansen et al., 2014] Hansen, N., Atamna, A., and Auger, A. (2014). How to assess step-size adaptation mechanisms in randomised search.

z

(41)

On Sphere with Low Effective Dimension

10² 10³ 10⁴ 10⁵

10⁻¹⁰ 10⁻⁹ 10⁻⁸ 10⁻⁷ 10⁻⁶ 10⁻⁵ 10⁻⁴ 10⁻³ 10⁻² 10⁻¹ 10⁰ 10¹ 10² 10³ 10⁴

f(x)

N = 10, M = 10 N = 100, M = 100 N = 100, M = 10

10² 10³ 10⁴ 10⁵

10 ¹⁰ 10 ⁹ 10 ⁸ 10 ⁷ 10 ⁶ 10 ⁵ 10 ⁴ 10 ³ 10 ² 10 ¹ 10⁰ 10¹ 10² 10³ 10⁴

f(x)

N = 10, M = 10 N = 100, M = 100 N = 100, M = 10

CSA TPA

On a function with low effective dimension f ( x ) = P _M

i=1 [ x ] ² _i , x 2 R ^N , M  N .

N M variables do not affect the function value

(42)

Alternatives: Success-Based Step-Size Control

comparing the fitness distributions of current and previous iterations

[Ait Elhara et al., 2013] Ait Elhara, O., Auger, A., and Hansen, N. (2013). A median success rule for non- elitist evolution strategies: Study of feasibility. In Proc. of the GECCO, pages 415–422.

[Loshchilov, 2014] Loshchilov, I. (2014). A computationally efficient limited memory cma-es for large scale optimization. In Proc. of the

Generalizations of 1 / 5th-success-rule for non-elitist and multi-recombinant ES

Median Success Rule

Population Success Rule

controls a success probability

An advantage over CSA and TPA: Cheap Computation It depends only on .

cf. CSA and TPA require a computation of C ¹ ^/ ² x and C ¹ x, respectively.

[Loshchilov, 2014]

[Ait Elhara et al., 2013]

(43)

Step-Size Control

Step-Size Control: Summary

Summary

Why Step-Size Control?

to achieve linear convergence at near-optimal rate Cumulative Step-Size Adaptation

efficient and robust for  N

inefficient on functions with (many) ineffective axes Alternative Step-Size Adaptation Mechanisms

Two-Point Step-Size Adaptation

Median Success Rule, Population Success Rule

the effective adaptation of the overall population diversity seems yet to pose open questions, in particular with recombination or without entire control over the realized distribution. ^a

a

Hansen et al. How to Assess Step-Size Adaptation Mechanisms in Randomised

Search. PPSN 2014

(44)

Step-Size Control Summary

Step-Size Control: Summary

On 20D TwoAxes Function: f ( x ) = P

N/2

i=1

[ Rx ]

²_i

+ a

²

P

_N

i=N/2+1

[ Rx ]

²_i

, R: orthogonal

convergence speed of CSA-ES becomes lower as the function becomes ill conditioned

(a

²

becomes greater) ) covariance matrix adaptation required

(45)

Covariance Matrix Adaptation (CMA)

Evolution Strategies

Recalling

New search points are sampled normally distributed

x i ⇠ m + N ⁱ ( 0 , C ) for i = 1 , . . . ,

as perturbations of m , where x i , m 2 R ⁿ , 2 R ⁺ , C 2 R ⁿ ^⇥ ⁿ

where

the mean vector m 2 R ⁿ represents the favorite solution the so-called step-size 2 R ⁺ controls the step length the covariance matrix C 2 R ⁿ ^⇥ ⁿ determines the shape of the distribution ellipsoid

The remaining question is how to update C .

(46)

Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update

Covariance Matrix Adaptation

Rank-One Update

m m + y w , y w = P _µ

i=1 w _i y i: , y i ⇠ N ⁱ ( 0 , C )

initial distribution, C = I

. . .equations

(47)

Covariance Matrix Adaptation

Rank-One Update

m m + y w , y w = P _µ

i=1 w _i y i: , y i ⇠ N ⁱ ( 0 , C )

initial distribution, C = I

(48)

Covariance Matrix Adaptation

Rank-One Update

m m + y w , y w = P _µ

i=1 w _i y i: , y i ⇠ N ⁱ ( 0 , C )

y w , movement of the population mean m (disregarding )

. . .equations

(49)

Covariance Matrix Adaptation

Rank-One Update

m m + y w , y w = P _µ

i=1 w _i y i: , y i ⇠ N ⁱ ( 0 , C )

mixture of distribution C and step y _w ,

C 0.8 ⇥ C + 0.2 ⇥ y w y ^T _w

(50)

Covariance Matrix Adaptation

Rank-One Update

m m + y w , y w = P _µ

i=1 w _i y i: , y i ⇠ N ⁱ ( 0 , C )

new distribution (disregarding )

. . .equations

(51)

Covariance Matrix Adaptation

Rank-One Update

m m + y w , y w = P _µ

i=1 w _i y i: , y i ⇠ N ⁱ ( 0 , C )

new distribution (disregarding )

(52)

Covariance Matrix Adaptation

Rank-One Update

m m + y w , y w = P _µ

i=1 w _i y i: , y i ⇠ N ⁱ ( 0 , C )

movement of the population mean m

. . .equations

(53)

Covariance Matrix Adaptation

Rank-One Update

m m + y w , y w = P _µ

i=1 w _i y i: , y i ⇠ N ⁱ ( 0 , C )

mixture of distribution C and step y _w ,

C 0.8 ⇥ C + 0.2 ⇥ y w y ^T _w

(54)

Covariance Matrix Adaptation

Rank-One Update

m m + y w , y w = P _µ

i=1 w _i y i: , y i ⇠ N ⁱ ( 0 , C )

new distribution,

C 0.8 ⇥ C + 0.2 ⇥ y w y ^T _w

the ruling principle: the adaptation increases the likelihood of successful steps, y _w , to appear again

another viewpoint: the adaptation follows a natural gradient

approximation of the expected fitness

^{. . .}^equations

(55)

Covariance Matrix Adaptation

Rank-One Update

Initialize m 2 R ⁿ , and C = I , set = 1, learning rate c _cov ⇡ 2 /n ² While not terminate

x _i = m + y _i , y _i ⇠ N ⁱ ( 0 , C ) , m m + y w where y w =

X µ i= 1

w _i y _i:

C ( 1 c _cov ) C + c _cov µ _w y w y ^T _w

rank-one |{z}

where µ _w = 1 P _µ

i=1 w _i ² 1

The rank-one update has been found independently in several domains ^{6 7 8 9}

6Kjellstr¨om&Tax´en 1981. Stochastic Optimization in System Design, IEEE TCS

7Hansen&Ostermeier 1996. Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation, ICEC

8

(56)

C (1 c

_cov

) C + c

_cov

µ

_w

y

w

y

^T_w

covariance matrix adaptation

learns all pairwise dependencies between variables

off-diagonal entries in the covariance matrix reflect the dependencies

conducts a principle component analysis (PCA) of steps y w , sequentially in time and space

eigenvectors of the covariance matrix C are the principle components / the principle axes of the mutation ellipsoid

learns a new rotated problem representation

components are independent (only) in the new representation

learns a new (Mahalanobis) metric

variable metric method

approximates the inverse Hessian on quadratic functions

transformation into the sphere function

for µ = 1: conducts a natural gradient ascent on the distribution N

entirely independent of the given coordinate system

. . .cumulation, rank-µ

(57)

Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update

Invariance Under Rigid Search Space Transformation

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

for example, invariance under search space rotation (separable … non-separable)

f-level sets in dimension 2

f = h _Rast f = h

(58)

Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update

Invariance Under Rigid Search Space Transformation

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

for example, invariance under search space rotation (separable … non-separable)

f-level sets in dimension 2

f = h _Rast ¶ R f = h ¶ R

(59)

Covariance Matrix Adaptation (CMA) Cumulation—the Evolution Path

Cumulation

The Evolution Path

Evolution Path

Conceptually, the evolution path is the search path the strategy takes over a number of generation steps. It can be expressed as a sum of consecutive steps of the mean m .

An exponentially weighted sum of steps y

^w

is used

p

c

/

X

g i=0

( 1 c

_c

)

^{g i}

| {z }

exponentially fading weights

y

⁽ⁱ⁾w

The recursive construction of the evolution path (cumulation):

p

c

(1 c

_c

)

| {z }

decay factor

p

c

+ p

1 (1 c

_c

)

²

p µ

w

| {z }

normalization factor

y

^w

|{z}

input = ^{m m}^old

(60)

Cumulation

The Evolution Path

Evolution Path

Conceptually, the evolution path is the search path the strategy takes over a number of generation steps. It can be expressed as a sum of consecutive steps of the mean m .

An exponentially weighted sum of steps y

^w

is used

p

c

/

X

g i=0

( 1 c

_c

)

^{g i}

| {z }

exponentially fading weights

y

⁽ⁱ⁾w

The recursive construction of the evolution path (cumulation):

p

c

(1 c

_c

)

| {z }

decay factor

p

c

+ p

1 (1 c

_c

)

²

p µ

w

| {z }

y

^w

|{z}

input = ^{m m}^old

where µ

w

=

^P¹_w

i2

, c

_c

⌧ 1. History information is accumulated in the evolution path.

(61)

“Cumulation” is a widely used technique and also know as exponential smoothing in time series, forecasting

exponentially weighted mooving average

iterate averaging in stochastic approximation

momentum in the back-propagation algorithm for ANNs . . .

“Cumulation” conducts a low-pass filtering, but there is more to it. . .

. . .why?

(62)

Cumulation

Utilizing the Evolution Path

We used y

w

y

^Tw

for updating C. Because y

w

y

^Tw

= y

w

( y

w

)

^T

the sign of y

w

is lost.

The sign information (signifying correlation between steps) is (re-)introduced by using the evolution path.

p

c

( 1 c

_c

)

| {z }

decay factor

p

c

+ p

1 ( 1 c

_c

)

²

p µ

_w

| {z }

y

w

C ( 1 c

cov

) C + c

cov

p

c

p

c T

| {z }

rank-one

where µ

_w

=

^P¹_w

i2

, c

cov

⌧ c

_c

⌧ 1 such that 1 /c

_c

is the “backward time horizon”.

(63)

Cumulation

Utilizing the Evolution Path

We used y

w

y

^Tw

for updating C. Because y

w

y

^Tw

= y

w

( y

w

)

^T

the sign of y

w

is lost.

The sign information (signifying correlation between steps) is (re-)introduced by using the evolution path.

p

c

( 1 c

_c

)

| {z }

decay factor

p

c

+ p

1 ( 1 c

_c

)

²

p µ

_w

| {z }

y

w

C ( 1 c

cov

) C + c

cov

p

c

p

c T

| {z }

rank-one

where µ

_w

=

^P¹

, c

cov

⌧ c

_c

⌧ 1 such that 1 /c

_c

is the “backward time horizon”.

(64)

Cumulation

Utilizing the Evolution Path

We used y

w

y

^Tw

for updating C. Because y

w

y

^Tw

= y

w

( y

w

)

^T

the sign of y

w

is lost.

The sign information (signifying correlation between steps) is (re-)introduced by using the evolution path.

p

c

( 1 c

_c

)

| {z }

decay factor

p

c

+ p

1 ( 1 c

_c

)

²

p µ

_w

| {z }

y

w

C ( 1 c

cov

) C + c

cov

p

c

p

c T

| {z }

rank-one

where µ

_w

=

^P¹_w

i2

, c

cov

⌧ c

_c

⌧ 1 such that 1 /c

_c

is the “backward time horizon”.

(65)

Using an evolution path for the rank-one update of the covariance matrix reduces the number of function evaluations to adapt to a straight ridge from about O (n ² ) to O (n) . ⁽ ^a ⁾

aHansen & Auger 2013. Principled design of continuous stochastic search: From theory to practice.

Number of f -evaluations divided by dimension on the cigar function f ( x ) = x

²₁

+ 10

⁶

P

_n

i=2

x

²_i

10¹ 10²

10² 10³ 10⁴

dimension

c _c = 1 (no cumulation)

c _c = 1 / p n c _c = 1/n

The overall model complexity is n ² but important parts of the model

can be learned in time of order n

(66)

Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-µ Update

Rank- µ Update

x

i

= m + y

i

, y

i

⇠ N

ⁱ

( 0 , C ) , m m + y

w

y

w

= P

µ

i=1

w

_i

y

_i:

The rank-µ update extends the update rule for large population sizes using µ > 1 vectors to update C at each generation step.

The weighted empirical covariance matrix C µ =

X µ i=1

w _i y i: y ^T _i:

computes a weighted mean of the outer products of the best µ steps and has rank min (µ, n) with probability one.

with µ = weights can be negative ¹⁰

The rank-µ update then reads

C ( 1 c _cov ) C + c _cov C µ

where c _cov ⇡ µ _w /n ² and c _cov  1.

10Jastrebski and Arnold (2006). Improving evolution strategies through active covariance matrix adaptation. CEC.

(67)

x_i = m + y_i, y_i ⇠ N (0,C)

sampling of = 150 solutions where C = I and = 1

Cµ = _µ¹ Py_i: y^T_i:

C (1 1) ⇥ C + 1 ⇥ Cµ

calculating C where µ = 50,

w ₁ = · · · = w _µ = _µ ¹ , and c _cov = 1

mnew m + _µ¹ P y_i:

new distribution

(68)

Rank- µ CMA versus Estimation of Multivariate Normal Algorithm EMNA _global ¹¹

x_i = m_old + y_i, y_i ⇠ N (0,C)

sampling of = 150 solutions (dots)

C _µ¹ P(x_i: m_old)(x_i: m_old)^T

C _µ¹ P(x_i: mnew)(x_i: mnew)^T

calculating C from µ = 50 solutions

mnew = m_old + _µ¹ Py_i:

new distribution

rank- µ CMA conducts a PCA of

steps

EMNA

global

conducts a PCA of

points

m

new

is the minimizer for the variances when calculating C

11 Hansen, N. (2006). The CMA Evolution Strategy: A Comparing Review. In J.A. Lozano, P. Larranga, I. Inza and E.

Bengoetxea (Eds.). Towards a new evolutionary computation. Advances in estimation of distribution algorithms. pp. 75-102

(69)

The rank- µ update

increases the possible learning rate in large populations

roughly from 2/n

²

to µ

w

/n

²

can reduce the number of necessary generations roughly from O (n ² ) to O (n) ⁽ ¹²⁾

given µ

w

/ / n

Therefore the rank-µ update is the primary mechanism whenever a large population size is used

say 3 n + 10

The rank-one update

uses the evolution path and reduces the number of necessary function evaluations to learn straight ridges from O (n ² ) to O (n) . Rank-one update and rank-µ update can be combined

. . .all equations

(70)

The rank- µ update

increases the possible learning rate in large populations

roughly from 2/n

²

to µ

w

/n

²

can reduce the number of necessary generations roughly from O (n ² ) to O (n) ⁽ ¹²⁾

given µ

w

/ / n

Therefore the rank-µ update is the primary mechanism whenever a large population size is used

say 3 n + 10

The rank-one update

uses the evolution path and reduces the number of necessary function evaluations to learn straight ridges from O (n ² ) to O (n) . Rank-one update and rank-µ update can be combined

. . .all equations 12Hansen, M¨uller, and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1), pp. 1-18

(71)

The rank- µ update

increases the possible learning rate in large populations

roughly from 2/n

²

to µ

w

/n

²

can reduce the number of necessary generations roughly from O (n ² ) to O (n) ⁽ ¹²⁾

given µ

w

/ / n

Therefore the rank-µ update is the primary mechanism whenever a large population size is used

say 3 n + 10

The rank-one update

uses the evolution path and reduces the number of necessary function evaluations to learn straight ridges from O (n ² ) to O (n) . Rank-one update and rank-µ update can be combined

. . .all equations

(72)

The rank- µ update

increases the possible learning rate in large populations

roughly from 2/n

²

to µ

w

/n

²

can reduce the number of necessary generations roughly from O (n ² ) to O (n) ⁽ ¹²⁾

given µ

w

/ / n

Therefore the rank-µ update is the primary mechanism whenever a large population size is used

say 3 n + 10

The rank-one update

uses the evolution path and reduces the number of necessary function evaluations to learn straight ridges from O (n ² ) to O (n) . Rank-one update and rank-µ update can be combined

. . .all equations 12Hansen, M¨uller, and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1), pp. 1-18

0 2500 5000 7500 100001250015000 10⁻¹¹

10⁻⁹ 10⁻⁷ 10⁻⁵ 10⁻³ 10⁻¹ 10¹ 10³ 10⁵

best f-value

0 2500 5000 7500 100001250015000

−1.0

−0.5 0.0 0.5 1.0 1.5 2.0 2.5

3.0 mean coordinates

0 2500 5000 7500 100001250015000 10⁻⁶

10⁻⁵ 10⁻⁴ 10⁻³ 10⁻² 10⁻¹

10⁰ step-size

0 2500 5000 7500 100001250015000 10⁻³

10⁻² 10⁻¹ 10⁰ 10¹

10² principle axis length

0 2500 5000 7500 100001250015000 10⁻¹¹

10⁻⁹ 10⁻⁷ 10⁻⁵ 10⁻³ 10⁻¹ 10¹ 10³ 10⁵

best f-value

0 2500 5000 7500 100001250015000

−0.5 0.0 0.5 1.0 1.5 2.0

10⁻⁵ 10⁻⁴ 10⁻³ 10⁻² 10⁻¹

10⁰ step-size

10⁻³ 10⁻² 10⁻¹ 10⁰

10¹ principle axis length

0 2500 5000 7500 100001250015000 10⁻¹¹

10⁻⁹ 10⁻⁷ 10⁻⁵ 10⁻³ 10⁻¹ 10¹ 10³ 10⁵

best f-value

0 2500 5000 7500 100001250015000

−1.0

−0.5 0.0 0.5 1.0 1.5

0 2500 5000 7500 100001250015000 10⁻⁵

10⁻⁴ 10⁻³ 10⁻² 10⁻¹

10⁰ step-size

0 2500 5000 7500 100001250015000 10⁻⁴

10⁻³ 10⁻² 10⁻¹ 10⁰ 10¹

10² principle axis length

Rank-one update

Rank-µ update

Hybrid (combined) update

f

_TwoAxes

( x ) =

5 i=1

x

²_i

+ 10

⁶

10 i=6

x

²_i

CMA-ES and Advanced Adaptation Mechanisms

CMA-ES and Advanced Adaptation Mechanisms

Youhei Akimoto 1 & Nikolaus Hansen 2

1. University of Tsukuba, Japan

2. Inria, Research Centre Saclay, France akimoto@cs.tsukuba.ac.jp

nikolaus.hansen@inria.fr

Tutorial: Evolution Strategies and CMA-ES (Covariance Matrix Adaptation)

Anne Auger & Nikolaus Hansen

Inria

Project team TAO

Research Centre Saclay – ˆIle-de-France

University Paris-Sud, LRI (UMR 8623), Bat. 660 91405 ORSAY Cedex, France

https://www.lri.fr/~hansen/gecco2014-CMA-ES-tutorial.pdf

We are happy to answer questions at any time.

Topics

1. What makes the problem difficult to solve?

2. How does the CMA-ES work?

• Normal Distribution, Rank-Based Recombination

• Step-Size Adaptation

• Covariance Matrix Adaptation

3. What can/should the users do for the CMA-ES to work effectively on their problem?

• Choice of problem formulation and encoding (not covered)

• Choice of initial solution and initial step-size

• Restarts, Increasing Population Size

• Restricted Covariance Matrix

Topics

1. What makes the problem difficult to solve?

2. How does the CMA-ES work?

• Normal Distribution, Rank-Based Recombination

• Step-Size Adaptation

• Covariance Matrix Adaptation

3. What can/should the users do for the CMA-ES to work effectively on their problem?

• Choice of problem formulation and encoding (not covered)

• Choice of initial solution and initial step-size

• Restarts, Increasing Population Size

Problem Statement

Continuous Domain Search/Optimization

Task: minimize an objective function (fitness function, loss function) in continuous domain

f : X ✓ R n ! R , x 7! f ( x ) Black Box scenario (direct search scenario)

f(x) x

gradients are not available or not useful

problem domain specific knowledge is used only within the black box, e.g. within an appropriate encoding

Search costs: number of function evaluations

Problem Statement

Continuous Domain Search/Optimization

Goal

fast convergence to the global optimum

. . . or to a robust solution x

solution x with small function value f ( x ) with least search cost

there are two conflicting objectives

Typical Examples

shape optimization (e.g. using CFD) curve fitting, airfoils

model calibration biological, physical

parameter calibration controller, plants, images

Problems

exhaustive search is infeasible

naive random search takes too long

deterministic search is not successful / takes too long

What Makes a Function Difficult to Solve?

Why stochastic search?

non-linear, non-quadratic, non-convex

on linear and quadratic functions much better search policies are available

ruggedness

non-smooth, discontinuous, multimodal, and/or noisy function

dimensionality (size of search space)

(considerably) larger than three

non-separability

dependencies between the objective variables

ill-conditioning

non-smooth level sets

Ruggedness

non-smooth, discontinuous, multimodal, and/or noisy

Fitness

cut from a 5-D example, (easily) solvable with evolution strategies

Separable Problems

Definition (Separable Problem)

A function f is separable if arg min

(x

,...,x

) f (x 1 , . . . , x n ) =

Youhei Akimoto ¹ & Nikolaus Hansen ²

f : X ✓ R ⁿ ! R , x 7! f ( x ) Black Box scenario (direct search scenario)

) f (x ₁ , . . . , x _n ) =

f (x ₁ , . . .), . . . , arg min

f (. . . , x _n )

f (x ₁ , . . . , x _n ) =

f _i (x _i )

Consider the convex-quadratic function f ( x ) = ¹ ₂ ( x x ^⇤ ) ^T H ( x x ^⇤ ) = ¹ ₂ P

i h _i,i (x _i x ^⇤ _i ) ² + ¹ ₂ P

i 6 =j h _i,j (x _i x ^⇤ _i )(x _j x ^⇤ _j )

gradient direction f ⁰ ( x ) ^T

Newton direction H ¹ f ⁰ ( x ) ^T

of H ¹ ) is necessary.

A black box search template to minimize f : R ⁿ ! R

Sample distribution P ( x | ✓) ! x ₁ , . . . , x 2 R ⁿ

Evaluate x ₁ , . . . , x on f

Update parameters ✓ F _✓ (✓ , x ₁ , . . . , x , f ( x ₁ ), . . . , f ( x ))

Everything depends on the definition of P and F _✓