MARKOV MODELS OF GENETIC ALGORITHMS - Mathematical Models of Genetic Algorithms

Mathematical Models of Genetic Algorithms

4.4 MARKOV MODELS OF GENETIC ALGORITHMS

Markov models were first used to model GAs in [Nix and Vose, 1992], [Davis and Principe, 1991], and citeDavis93, and are further explained in [Reeves and Rowe, 2003] and [Vose, 1999]. As we saw in Chapter 3, a GA consists of selection, crossover, and mutation. For the purposes of Markov modeling, we will switch the order of crossover and mutation, so we will consider a GA which consists of selection, mutation, and crossover, in that order.

4.4.1 Selection

First we consider fitness-proportional (that is, roulette-wheel) selection. The proba-bility of selecting an x^ individual with one spin of the roulette wheel is proportional to the fitness of the Xi individual, multiplied by the number of Xi individuals in the population. This probability is normalized so that all probabilities sum to 1. As defined in the previous section, vi is the number of Xi individuals in the population.

Therefore, the probability of selecting an Xi individual with one spin of the roulette wheel is

Ps(Xi\v) = Vifi (4.36)

for i £ [l,n], where n is the cardinality of the search space, and fj is the fitness of Xj. We use the notation Ps(xi\v) to show that the probability of selection an x^

individual depends on the population vector v. Given a population of N individuals, suppose that we spin the roulette wheel N times to select N parents. Each spin of the roulette wheel has n possible outcomes {xi, · · · ,xn}· The probability of obtaining outcome Xi at each spin is equal to Ps(xi\v). Let U = [ U\ · · · Un ] be a vector of random variables where Ui denotes the total number of times that Xi occurs in N spins of the roulette wheel, and let u = [ u\ · · · uⁿ } be a realization of U. Multinomial distribution theory [Evans et al., 2000] tells us that

Pvs{u\v) = N\f[[P'{Xil")]Ui. (4.37)

A -*■ 7/..· Î

SECTION 4.4: MARKOV MODELS OF GENETIC ALGORITHMS 7 7

This gives us the probability of obtaining the population vector u after N roulette-wheel spins if we start with the population vector v. We use the subscript s on Vrs{u\v) to denote that we consider only selection (not mutation or crossover).

Now recall that a Markov transition matrix contains all of the probabilities of transitioning from one state to another. Equation (4.37) gives us the probability of transitioning from one population vector v to another population vector u. There are T possible population vectors, as discussed in the previous section. Therefore, if we calculate Equation (4.37) for each possible u and each possible v, we will obtain a T x T Markov transition matrix which gives an exact probabilistic model of a selection-only G A. Each entry of the transition matrix contains the probability of transitioning from some particular population vector to some other population vector.

4.4.2 Mutation

Now suppose that after selection, we implement mutation on the selected individ-uals. Define Mji as the probability that Xj mutates to X{. Then the probability of obtaining an Xi individual after a single spin of the roulette wheel, followed by a single chance of mutation, is

) = Y^M3lPs{x3\v) (4.38) for i G [1, n]. This means that we can write the n-element vector whose i-th element

is equal to P^sm(xi\v) as follows:

Psm(x\v) = MTPs(x\v) (4.39) where M is the matrix containing Mji in the j - t h row and z-th column, and P³(x\v)

is the n-element vector whose j - t h element is P^S(XJ\V). Now we use multinomial distribution theory again to find that

Prsm(u\v) = N\f[ 1Ρ™(ΧΜ"\ ( 4 4 0 )

This gives us the probability of obtaining the population vector u if we start with the population vector v, after both selection and mutation take place. If we calculate Equation (4.40) for each of the T possible u and v population vectors, we will have a T x T Markov transition matrix which gives an exact probabilistic model of a GA which consists of both selection and mutation.

If mutation is defined so that Mji > 0 for all i and j , then Pr^srn(u\v) > 0 for all u and v. This means that the Markov transition matrix will contain all positive entries, which means that the transition matrix will be regular. Theo-rem 4.2 tells us that there will be a unique nonzero probability for obtaining each possible population distribution. This means that in the long run, each possible population distribution will occur for a nonzero percent of time. These percent-ages can be calculated using Theorem 4.2 and the transition matrix obtained from Equation (4.40). The GA will not converge to any specific population, but will endlessly wander throughout the search space, hitting each possible population for the percent of time given in Theorem 4.2.

EXAMPLE 4.7

Suppose we have a four-element search space with individuals x = {00,01,10,11}.

Suppose that each bit in each individual has a 10% chance of mutation. The probability that 00 remains equal to 00 after a mutation chance is equal to the probability that that first 0 bit remains unchanged (90%), multiplied by the probability that the second 0 bit remains unchanges (90%), which gives a probability of 0.81. This gives M n , which is the probability that x\ remains unchanged after a mutation chance. The probability that 00 will change to 01 is equal to the probability that that first 0 bit remains unchanged (90%), multiplied by the probability that the second 0 bit changes to a 1 (10%), which gives a probability of Mⁱ² =0.09. Continuing along these lines, we find that

" 0.81 0.09 0.09 0.01 0.09 0.81 0.01 0.09 0.09 0.01 0.81 0.09 0.01 0.09 0.09 0.81

M (4.41)

Note that M is symmetric (that is, M is equal to its transpose M^T). This is typically (but not always) the case, which means that it is equally likely for Xi to mutate to form Xj, as it is for Xj to mutate to form x^it

4.4.3 Crossover

Now suppose that after selection and mutation, we implement crossover. We let Tjki denote the probability that Xj and Xk cross to form Xi. Then the probability of obtaining an Xi individual after two spins of the roulette wheel, followed by a single chance of mutation for each selected individual, followed by crossover, is

n n

(xk\v). (4.42)

j = l fc=l

Now we use multinomial distribution theory again to find that

Prsmc(u\v) = Nlf[ [ P*m c ( X;| t ; r. (4.43) i=i Ui'

This gives us the probability of obtaining the population vector u if we start with the population vector v, after selection, mutation, and crossover take place.

■ EXAMPLE 4.8

Suppose we have a four-element search space with individuals {xi, x², £3, £4} = {00,01,10,11}. Suppose that we implement crossover by randomly setting b = 1 or b = 2 with equal probability, and then concatenating bits 1 —>· b from the first parent with bits ( 6 + 1 ) —> 2 from the second parent. Some of the crossover possibilities can be written as follows:

SECTION 4.4: MARKOV MODELS OF GENETIC ALGORITHMS 7 9

The other r ^ values can be calculated similarly.

EXAMPLE 4.9

In this example we consider the three-bit one-max problem. Each individual's fitness value is proportional to the number of ones in the individual:

/(000) = 1, /(001) = 2, /(010) = 2, /(Oil) = 3,

/(100) = 2, /(101) = 3, /(110) = 3, /(111) = 4.⁽*Λ0)

Suppose each bit has a 10% probability of mutation, which gives the mutation matrix derived in Example 4.7. After selection and mutation, we perform crossover with a probability of 90%. If crossover is selected, then crossover is performed by selecting a random bit position be [Ι,ρ — 1], where q is the number of bits in each individual. We then concatenate bits 1 —» b from the first parent with bits (b + 1) —> q from the second parent.

Let's use a population size N = 3. There are (n + N — l)-choose-7V = 10-choose-3 = 120 possible population distributions. We can use Equation (4.43) to calculate the probability of transitioning between each of the 120 popula-tion distribupopula-tions, which gives us a 120 x 120 transipopula-tion matrix P . We can then calculate the probability of each possible population distribution in three different ways:

1. We can use the Davis-Principe result of Equation (4.15);

2. From Theorem 4.2, we can numerically raise P to ever-increasing higher pow-ers until it converges, and then use any of the rows of P°° to observe the probability of each possible population;

3. We can calculate the eigendata of P^T and find the eigenvector corresponding to the 1 eigenvalue.

Each of these approaches give us the same set of 120 probabilities for the 120 population distributions. We find that the probability that the population

contains all optimal individuals, that is, each individual is equal to the bit string 111, is 6.1%. The probability that the population contains no optimal individuals is 51.1%. Figure 4.2 shows the results of a simulation of 20,000 generations, and shows that the simulation results closely match the Markov results. The simulation results are approximate, will vary from one run to the next, and will equal the Markov results only as the number of generations approaches infinity.

^"50 o

§ 4 0 |

ro 20

■

j*V "

- no optimal

-all optima |

-0.5 1 1.5 generation number

Figure 4.2 Example 4.9: Three-bit one-max simulation results. Markov theory predicts that the percentage of no optima is 51.1% and the percentage of all optima is 6.1%.

EXAMPLE 4.10

Here we repeat Example 4.9 except we use the following fitness values:

/(000) = 5, /(001) = 2, /(010) = 2, /(Oil) = 3,

/(100) = 2, /(101) = 3, /(110) = 3, /(111) = 4. (4.47) These fitness values are the same as those in Equation (4.46), except that we made the 000 bit string the most fit individual. This is called a deceptive problem because usually when we add a 1 bit to one of the above individuals its fitness increases. The exception is that 111 is not the most fit individual, but rather 000 is the most fit individual.

As in Example 4.9, we calculate a set of 120 probabilities for the 120 popu-lation distributions. We find that the probability that the popupopu-lation contains all optimal individuals, that is, each individual is equal to the bit string 000, is 5.9%. This is smaller than the probability of all optima in Example 4.9, which was 6.1%. The probability that the population contains no optimal individu-als is 65.2%. This is larger than the probability of no optima in Example 4.9, which was 51.1%. This example illustrates that deceptive problems are more difficult to solve than problems with a more regular structure. Figure 4.3 shows the results of a simulation of 20,000 generations, and shows that the simulation results closely match the Markov results.

SECTION 4.4: MARKOV MODELS OF GENETIC ALGORITHMS 8 1

Figure 4.3 Example 4.10: Three-bit deceptive problem simulation results. Markov theory predicts that the percentage of no optima is 65.2% and the percentage of all optima is 5.9%.

The Curse of Dimensionality: The curse of dimensionality is a phrase which was originally used in the context of dynamic programming [Bellman, 1961]. How-ever, it applies even more appropriately to Markov models for GAs. The size of the transition matrix of a Markov model of an EA is T x T, where T = (N + n — 1)-choose-iV. The transition matrix dimensions for some combinations of population size N, and search space cardinality n, which is equal to 2^q for q-bit search spaces, are shown in Table 4.1. We see that the transition matrix dimension grows ridicu-lously large for problems of even modest dimension. This seems to indicate that Markov modeling is interesting only from a theoretical viewpoint, and does not have any practical applications. However, there are a couple of reasons that such a response may be premature.

Table 4.1 Markov transition matrix dimensions for various search space cardinalities n and population sizes N. Adapted from [Reeves and Rowe, 2003, page 131].

First, although we cannot apply Markov models to realistically-sized problems, Markov models still give us exact probabilities for small problems. This allows us to look at the advantages and disadvantages of different EAs for small problems, assuming that we have Markov models for EAs other than GAs. This is exactly what we do in [Simon et al., 2011b] when we compare GAs with BBO. A lot of research in EAs today is focused on simulations. The problem with simulations is that their outcomes depend so strongly on implementation details and on the

specific random number generator that is used. Also, if some event has a very small probability of occurring, then it would take many simulations to discover that probability. Simulation results are useful and necessary, but they must always be taken with a dash of skepticism and a grain of salt.

Second, the dimension of the Markov transition matrices can be reduced. Our Markov models include T states, but many of these states are very similar to each other. For example, consider a GA with a search space cardinality of 10 and a population size of 10. Table 4.1 shows us that the Markov model has 1023 states, but these include the states

v(l) = {5,5,0,0,0,0,0,0,0,0}

v(2) = {4,6,0,0,0,0,0,0,0,0}

v{3) = {6,4,0,0,0,0,0,0,0,0}. (4.48) These three states are so similar that it makes sense to group them together and

consider them as a single state. We can do this with many other states to get a new Markov model with a reduced state space. Each state in the reduced-order model consists of a group of the original states. The transition matrix would then specify the probability of transitioning from one group of original states to another group of original states. This idea was proposed in [Spears and De Jong, 1997] and is further discussed in [Reeves and Rowe, 2003]. It is hard to imagine how to group states to reduce a 1023 x 1023 matrix to a manageable size, but at least this idea allows us to handle larger problems than we would be able to otherwise.

Dans le document EVOLUTIONARY OPTIMIZATION ALGORITHMS (Page 110-116)