Elements of BGP

Assumptions

BGP assumes complete data, meaning that instances should not contain missing or unknown values. Also, each instance must have a target classification, making BGP a supervised learner. BGP assumes attributes to be one of four data types:

• Numerical and discreet (which implies an ordered attribute), for example the age of a patient.

• Numerical and continuous, for example length.

• Nominal (not ordered), for example the attribute colour.

• Boolean, which allows an attribute to have a value of either true or false.

The proposed knowledge discovery tool is based on the concept of a building block. A building block represents one condition, or node in the tree. Each building block consists of three parts: <attribute> <relational operator> <threshold>. An

<attribute> can be any of the attributes of the database. The <relational operator>

can be any of the set {=, ≠, <, ≤, >, ≥} for numerical attributes, or {=, ≠} for nominal and boolean attributes. The <threshold> can be a value or another attribute.

Allowing the threshold to be an attribute makes it possible to extract rules such as

‘IF Income > Expenditure THEN outcome’. It is however possible that the decision tree contain nodes where incompatible attributes are compared, for example ‘Sex >

Expenditure’. This problem can, however, be solved easily by including semantic rules to prevent such comparisons, either by penalizing such trees, or by preventing any operation to create such comparisons.

The initial population is constructed such that each individual consists of only one node and two leaf nodes corresponding to the two outcomes of the condition.

The class of a leaf node depends on the distribution of the training instances over the classes in the training set and the training instances over the classes, propagated to the leaf node. To illustrate this, consider a training set consisting of 100 instances

which are classified into four classes A, B, C and D. Of these instances, 10 belong to class A, 20 to class B, 30 to class C and 40 to class D. Thus, the distribution of the classes in this example is skewed [10,20,30,40]. If the classes were evenly distrib-uted, there would be 25 instances belonging to class A, also 25 belonging to class B, etc. Now let’s say there are 10 out of the 100 instances, which are propagated to the leaf node, for which we want to determine the classification. Suppose these 10 instances are distributed in the following way: [1,2,3,4]. Which class should be put into the leaf node when the distribution of training instances over the classes in the training set is the same as the distribution in the leaf node? In this case we chose to put the class with the highest number of instances into the leaf node, which is class D. What happens if the two distributions are dissimilar to each other? Let’s say the overall class distribution is again [10,20,30,40] and the distribution of the instances over the classes propagated to the leaf node this time is [2,2,2,2]. A correction factor that accounts for the overall distribution will be determined for each class first. The correction factors are 25/10, 25/20, 25/30 and 25/40 for classes A, B, C and D respectively, where 25 is the number of instances per class in case of an equal distribution. After this, the correction factors are combined with the distribution of the instances in the leaf node and the class corresponding to the highest number is chosen ( [(25/10)*2, (25/20)*2, (25/30)*2, (25/40)*2 ] which equals [ 5, 1.25, 3.33, 1.88 ] and means class A will be chosen).

A first choice for the fitness function could be the classification accuracy of the decision tree represented by an individual. However, taking classification accuracy as a measure of fitness, for a set of rules when the distribution of the classes among the instances is skewed, does not account for the significance of rules that predict a class that is poorly represented. Instead of using the accuracy of the complete set of rules, BGP uses the accuracy of the rules independently and determines which rule has the lowest accuracy on the instances of the training set that are covered by this rule. In other words, the fitness of the complete set of rules is determined by the weakest element in the set.

In the equation below, the function C(i) returns 1 if the instance i is correctly classified by rule R and 0 if not. If rule R covers P instances of the training set, then

Accuracy R C i P

( )=

∑

P₌1 ( )/

In short, the function above calculates the accuracy of a rule over the instances that are covered by this rule. Let S be a set of rules. Then, the fitness of an individual is expressed as

Fitness (S) = MIN (Accuracy(R )) for all R ∈ S

Tournament selection is used to select parents for crossover. Before crossover is applied, the crossover probability P_C and a random number r between 0 and 1 determine whether the crossover operation will be applied: if r < P_C then crossover is applied, otherwise not. Crossover of two parent trees is achieved by creating two copies of the trees that form two intermediate offspring. Then one crossover point is selected randomly in each copy. The final offspring is obtained by exchanging sub-trees under the selected crossover points.

The current BGP implementation uses three types of mutation, namely muta-tion on the relamuta-tional operator, mutamuta-tion on the threshold and prune mutamuta-tion. Each of these mutations occurs at a user-specified probability. Given the probability on a relational operator, M_RO, between 0 and 1, a random number r between 0 and 1 is generated for every condition in the tree. If r < M_RO, then the relational operator in the particular condition will be changed into a new relational operator. If the attribute on the left-hand side in the condition is numerical, then the new relational operator will be one of the set {=, ≠, <, ≤, >, ≥}. If the attribute on the left-hand side is not numerical, the new relational operator will either be = or ≠.

A parameter M_RHSdetermines the chance that the threshold of a condition in the decision tree will be mutated. Like the previous mutation operator, a random number r between 0 and 1 is generated, and if r < M_RHS then the right-hand side of the particular condition will be changed into a new hand side. This new right-hand side can be either a value of the attribute on the left-right-hand side, or a different attribute. The probability that the right-hand side is an attribute is determined by yet another parameter P_A. The new value of the right-hand side is determined randomly.

A parameter P_Pis used to determine whether the selected tree should be pruned, in the same way as the parameter P_C does for the crossover operator. If the pruning operation is allowed, a random internal node of the decision tree is chosen and its sub-tree is deleted. Two leaf nodes classifying the instances, that are propagated to these leaf nodes, replace the selected internal node.

The Algorithm

Algorithm BGP T := 0

Select initial population Evaluate population P(T)

while not ‘termination-condition’ do T := T + 1

if ‘add_conditions_criterion’ then add condition to trees in population end if

Select subpopulation P(T) from P(T-1)

Apply recombination operators on individuals of P(T) Evaluate P(T)

if ‘found_new_best_tree’ then store copy of new best tree end if

end while end algorithm

Two aspects of the BGP algorithm still need to be explained. The first is the condition which determines if a new building block should be added to each of the

individuals in the current population. The following rule is used for this purpose:

IF ( (ad_(t) + aw_(t) ) - (ad_(t-1) + aw_(t-1) ) < L ) THEN ‘add_conditions’

where ad_(t) means the average depth of the trees in the current generation t, aw_(t) means the average width in the current generation, ad_(t-1) is the average depth of the trees in the previous generation and aw_(t-1) is the average width in the previous generation. L is a parameter of the algorithm and is usually set to 0. In short, if L = 0, this rule determines if the sum of the average depth and width of the trees in a generation decreases and adds conditions to the trees if this is the case. If the criterion for adding a condition to the decision trees is met, then all the trees in the population receive one randomly generated new condition. The newly generated condition replaces a randomly chosen leaf node of the decision tree. Since the new condition to be added to a tree is generated randomly, there is a small probability that the newly generated node is a duplicate. Such duplicates are accepted within the tree; however, when rules are extracted from the best individual after convergence, a simplification step is used to remove redundant conditions from the extracted rules. A different approach could be to prevent adding a node to the tree which already exists within that tree. The same applies to mutation and cross-over operators. Furthermore, it is possible that a node added to a tree (by means of the growth operator, mutation or cross-over) represents a condition that contradicts that of another node within the tree. This is not a problem, since such contradictions will result in a low accuracy of the tree, hence a low fitness value.

Finally, the evolutionary process stops when a satisfactory decision tree has been evolved. BGP uses a termination criterion similar to the temperature function used in simulated annealing (Aarts & Korst, 1989). It uses a function that calculates a goal for the fitness of the best rule set that depends on the temperature at that stage of the run. At the start of the algorithm, the temperature is very high and so is the goal for the fitness of the best rule set. With time, the temperature drops, which in turn makes the goal for the fitness of the best rule set easier to obtain. T(t) is the temperature at generation t defined by a very simple function: T(t) = T₀ - t. T₀ is the initial temperature, a parameter of the algorithm. Whether a rule set S (as represented by an individual) at generation t is satisfactory is determined by the following rule:

IF Fitness S e THEN satisfactory

c trainsize

where c is a parameter of the algorithm (usually set to 0.1 for our experiments), and trainsize is the number of training instances. When the temperature gets close to zero, the criterion for the fitness of the best tree quickly drops to zero too. This ensures that the algorithm will always terminate within T₀ generations.

Dans le document Data Mining: A Heuristic Approach (Page 189-193)