THE (1+1) EVOLUTION STRATEGY - Mathematical Models of Genetic Algorithms

Mathematical Models of Genetic Algorithms

6.1 THE (1+1) EVOLUTION STRATEGY

Suppose that f(x) is a function of a real random vector x, and that we want to maximize the fitness f(x). The original ES algorithm operated by initializing a single candidate solution and evaluating its fitness. The candidate solution was then mutated, and the mutated individual's fitness was evaluated. The best of the two candidate solutions (parent and child) formed the starting point for the next generation. The original ES was designed for discrete problems, used small mutations in a discrete search space, and thus tended to get trapped in a local optimum. The original ES was therefore modified to use continuous mutations in continuous search spaces [Beyer and Schwefel, 2002]. This algorithm is summarized in Figure 6.1.

Figure 6.1 is called a (1+1)-ES because each generation consists of 1 parent and 1 child, and the best individual is chosen from the parent and child as the individual in the next generation. The (1+1)-ES, also called the two-membered ES is very similar to the hill climbing strategies of Section 2.6. It is also the same as an EP with a population size of 1 (see Section 5.1). The following theorem guarantees that the (1+1)-ES eventually finds the global maximum of f{x).

Theorem 6.1 If f(x) is a continuous function defined on a closed domain with a global optimum f*(x), then

Hm /(*) = /*(*) (6.1)

t-ïoo

where t is the generation number.

SECTION 6.1: THE (1+1) EVOLUTION STRATEGY 1 1 9

Initialize the non-negative mutation variance σ²

#o <— randomly generated individual While not (termination criterion)

Generate a random vector r with r* ~ N(0, σ²) for i G [1, n]

xi <- xo + r

If xi is better than XQ then Xo <— x\

End if Next generation

Figure 6.1 The above pseudocode outlines the (1+1) evolution strategy, where n is the problem dimension, and x\ is equal to xo but with each element mutated.

Theorem 6.1 is proven in [Devroye, 1978], [Rudolph, 1992], [Back, 1996, Theo-rem 7], and [Michalewicz, 1996]. It also agrees with our intuition. Since we use random mutations to explore the search space, given enough time, we will even-tually explore the entire search space (to within computer precision) and find the global optimum.

The σ2 variance in the (1+1)-ES of Figure 6.1 is a tuning parameter. The value of σ is a tradeoff.

• σ should be large enough so that mutations can reach all areas of the search space in a reasonable period of time.

• σ should be small enough so that the search can find the optimal solution within the user's desired resolution.

It may be appropriate to decrease σ as the ES progresses. At the beginning of the ES, large values of σ will allow the ES to conduct a coarse-grained search and get close to the optimal solution. Toward the end of the ES, smaller values of σ will allow the ES to fine-tune its candidate solution and converge to the optimal solution with better resolution.

The mutation in Figure 6.1 is called isotropic because the mutation of each element of XQ has the same variance. In practice we might want to implement non-isotropic mutations as follows:

xi ^χο + Ν(0,Σ) (6.2)

where Σ is an n x n diagonal matrix with diagonal elements σ* for i e [l,n]. This means that each element of x$ is mutated with a different variance. We would assign each σ^ independently, depending on the domain of the i-th element of x and the shape of the objective function in that dimension.

Rechenberg analyzed the (1+1)-ES for some simple optimization problems and concluded that 20% of mutations should result in improvements in the fitness func-tion f(x) [Rechenberg, 1973], [Back, 1996, Secfunc-tion 2.1.7]. We reproduce some of his analysis in Section 6.2. If the mutation sucess rate is higher than 20%, then the mutations are too small, which leads to small improvements, which results in long convergence times. If the mutation success rate is lower than 20%, then the

mutations are too large, which leads to large but infrequent improvements, and this also results in long convergence times. Rechenberg's work led to the 1/5 rule:

In the (1+1)-ES, if the ratio of successful mutations to total mutations is less than 1/5, then the standard deviation σ should be decreased. If the ratio is more than 1/5, then the standard deviation should be increased.

This rule really applies only to a couple of specific objective functions as we will see in Section 6.2, but it has proven to be a useful guideline for general ES implementations. But the 1/5 rule raises the question, By how much should the standard deviation be decreased or increased? Schwefel theoretically derived the factor by which to decrease or increase σ:

standard deviation decrease: σ «— ca standard deviation increase: σ <— σ/c

where c = 0.817. (6.3) These results lead to the adaptive (1+1)-ES shown in Figure 6.2. The adaptive

(1+1)-ES requires that we define a moving window length G. We want G to be large enough to get a good idea of the success rate of the ES mutations, but not so large that the adaptation of σ is sluggish. The recommendation in [Beyer and Schwefel, 2002] is

G = min(n, 30) (6.4) where n is the problem dimension.

Initialize the non-negative mutation variance σ2

XQ <— randomly generated individual While not (termination criterion)

Generate a random vector r with r* ~ N(0, σ²) for i G [1, n]

xi 4- XQ + r

If X\ is better than XQ then

XQ <— X\

End if

φ <— proportion of successful mutations during the past G generations If φ < 1/5

σ 4- c²a else if 0 > 1/5

σ <- σ/c2

End if Next generation

Figure 6.2 The above pseudo-code outlines the adaptive (1+1) evolution strategy where n is the problem dimension. x\ is equal to xo but with each feature mutated, φ is the proportion of mutations during the past G generations that result in xi being better than xo- The mutation variance is automatically adjusted to increase the rate of convergence.

The nominal value of c is 0.817.

SECTION 6.1: THE (1+1) EVOLUTION STRATEGY 121

EXAMPLE 6.1

In this example, we use the (1+1)-ES to optimize the 20-dimensional Ackley benchmark function (see Appendix C.1.2). We compare the standard (1+1)-ES shown in Figure 6.1 with the adaptive (1+1)-(1+1)-ES shown in Figure 6.2. For the adaptive ES, we keep track of the number of successful mutations and the total number of mutations. The total number of mutations is equal to the number of generations. Every 20 generations, we examine the mutation success rate, and adjust the standard deviations as shown in Equation (6.3).

Figure 6.3 compares the average convergence rate of the standard (1+1)-ES and the adaptive (1+1)-ES. We see that the adaptive ES converges much faster than the standard ES. Figure 6.4 shows a typical profile of the mu-tation success rate and the mumu-tation standard deviation. We see that when the success rate has been greater than 20% over the previous 20-generation time span, the mutation standard deviation is automatically increased; when the success rate has been less than 20%, the mutation standard deviation is automatically decreased.

20.5 O o

19.5

- - - Without Standard Deviation Adaptation

— With Standard Deviation Adaptation

100 200 300

Generation 400 500

Figure 6.3 Example 6.1: The convergence of the (1+1)-ES algorithms, averaged over 100 simulations. The adaptive ES, which automatically adjusts the mutation standard deviations, converges much faster than the standard ES.

Standard Deviation Success Rate

200 300

Generation 400 500

Figure 6.4 Example 6.1: The mutation success rate and mutation standard deviation of the adaptive (l-f-l)-ES. The adaptive ES automatically increases the mutation standard deviation when the success rate is greater than 20%, and decreases the mutation standard deviation when the success rate is less than 20%. This illustrates the 1/5 rule.

Dans le document EVOLUTIONARY OPTIMIZATION ALGORITHMS (Page 152-156)