OTHER ALGORITHMS FOR NUGGET DISCOVERY - Data Mining: A Heuristic Approach

In the following section we will present the results of applying the heuristic algorithms for nugget discovery to some problems. Other algorithms available for classification or for nugget discovery were applied to the same problems so that the suitability of different approaches to the task of nugget discovery could be assessed.

The algorithms tested ranged from complete classification algorithms, to nugget discovery algorithms and association rule algorithms. A description of each algorithm is not possible here due to space constraints, hence the reader is referred to appropriate papers. For each algorithm, extensive parameter experimentation was carried out and a set of good parameters was chosen, but again this is not reported here; see de la Iglesia (2001) for details.

The classification algorithms chosen were:

− C5/C4.5: This is the most prominent decision tree induction algorithm, which also contains functionality to extract a non-mutually exclusive set of rules (Quinlan, 1986, 1993). Each rule is a nugget of the form described previously.

For categorical attributes the type of tests contained in the nuggets formed by the C5 algorithm are either a single value or a subset of values. For numeric attributes, the tests contained are binary partitions.

− CN2: This is a rule induction algorithm for complete classification described by Clark and Boswell (1991) and Clark and Niblett (1989). If an unordered rule set is extracted, each of the rules is a nugget that can be assessed on its own. For categorical attributes, it uses single-value tests only, whereas for numerical attributes it uses tests containing a range of values.

− Brute: This is a nugget discovery algorithm (Riddle et al., 1994) which performs an exhaustive depth-bounded search for conjunctive rules, guided by a chosen interest measure (the choices include Laplace accuracy, χ², and a weighted average of accuracy and coverage). The nuggets extracted use binary partitions for numeric attributes and either single value, subset of values or inequality tests for categorical attributes.

− RIPPER: Repeated Incremental Pruning to Produce Error Reduction (Cohen, 1995) is another rule induction algorithm that produces a complete classification.

The rules that form part of the classification are conjunctive nuggets. The nuggets obtained have binary partitions for numeric attributes and either single value or inequality tests for categorical attributes.

− KnowledgeSEEKER: The commercial package KnowledgeSEEKER (Biggs et al., 1991; de Ville, 1994) is another tree induction algorithm which is based on

the statistical approach to tree induction. Each branch (or even partial branch up to any node) of the tree can be considered as a nugget. For categorical attributes, the nuggets can contain single value or subsets of values, whereas for numerical attributes it uses tests containing a range of values.

− 1R: This is a rule induction algorithm that produces rules which base the classification of examples upon a single attribute. The algorithm (Holte, 1993;

Nevill-Manning et al., 1995) assumes categorical attributes, but can handle numeric attributes as they are discretised as a preprocessing operation. The interpretation of the classification produced by 1R as a set of nuggets can be done in various ways. If the attribute chosen for the classification is categorical, then each value of the attribute can be interpreted as a conjunctive test, or a subset of values can be considered together. In the experiment that follows all possibilities of individual tests or subsets of tests were considered as nuggets for the attribute chosen in each case. For a numerical attribute, each interval produced by the discretisation can be considered individually, or various intervals can be considered together as a disjunction. In the experiments that follow, both individual ranges and disjunction of ranges were evaluated.

− T2: This is a tree induction algorithm which produces a decision tree containing at most two levels (Auer et al., 1995). In the first level the tree contains a test on a single attribute, with numeric attributes using a binary partition and discrete attributes using a simple value. If there is a second level, a test on a numeric attribute will test on range of values instead of a binary partition. Each branch or partial branch of a T2 tree can be interpreted as a nugget.

− PART: This algorithm (Frank & Witten, 1998) generates rule sets and combines concepts of C4.5 and RIPPER. Each rule of the rule set can be considered as a nugget, but they need to be re-evaluated individually as they form part of an ordered rule set. PART uses simple value tests for categorical attributes, and binary partitions for numerical attributes.

− RISE: This algorithm (Domingos, 1995, 1996) combines rule induction with Instance Based Learning (IBL). Since IBL explicitly memorises some or all of the examples and attempts to find the most similar cases to the target one, the resulting classifier is made up of some abstractions formed by the rule induction part of the algorithm and some individual examples. It is only worth considering the abstractions as nuggets, and this is what was done in the following experiments. The abstractions are rules using range of values for tests on numeric attributes and simple value tests for categorical attributes.

Also, the following association rule algorithms were chosen:

− GRI: Generalised Rule Induction (Mallen & Bramer, 1995) is described as an association rule algorithm, although it could also be considered as a partial classification algorithm. It builds a table of the best N association rules, as

ranked by the J measure, where N is a parameter set by the user. In GRI the output attribute can be chosen, and each rule produced can be used as a nugget describing that output. They contain binary partitions for numeric attributes and tests on a simple value for categorical attributes.

− Apriori: The Apriori algorithm (Agrawal et al., 1993, 1996) is the most prominent association rule algorithm. Pre-discretisation of numeric attributes is necessary, since the algorithm can only handle categorical attributes. A simple equal width discretisation scheme was used for this. The output of this algorithm is not constraint to rules for a particular attribute, hence only the nuggets relating to the class under scrutiny need to be analysed for the task of nugget discovery. The Apriori rules contain simple value tests for categorical attributes.

EXPERIMENTS

The heuristic algorithms, as well as the 11 algorithms mentioned above, were applied to the problem of nugget discovery using four different databases extracted from the UCI repository of databases (Merz & Murphy, 1998). The databases chosen are briefly summarised in Table 1.

Full details of all the experiments carried out, and more details on each database, are presented in de la Iglesia (2001). Here only a summary of the findings is presented.

For the first three databases, patterns were sought for all the classes. For the last database, patterns were sought only for two of the classes, the majority class with 48.8% representative examples and the minority class with 0.5% representative examples. For the Adult database, records with missing values were removed, and for the mushroom database one of the attributes, which contains a high percentage of missing values, was also removed as a pre-processing step. The other datasets contained no missing values. Also, for the Apriori algorithm, numeric attributes were discretised using a simple equal-width bin scheme. In the case of the Forest Cover Type database, a balanced version of the data was produced for the extraction of nuggets for the minority class. This balanced version contained a reduced dataset, with records belonging to classes other than the minority class removed at random, until the minority class represented a higher percentage of the data (over 30%).

Algorithms were tried on both the balanced and the complete data set, and the best

Name Records Attrib. Classes Numeric Categorical

Attributes Attributes

Adult 45,222 14 2 6 8

Mushroom 8,124 21 2 0 21

Contraception 1,473 9 3 2 7

Forest Cover Type 581,012 54 7 10 14

Table 1: Databases from the UCI repository chosen for nugget discovery

nuggets obtained by either were selected for comparison. The balanced version of the data was included because techniques such as C5 are known to give best results if balanced databases are used (de la Iglesia, Debuse and Rayward-Smith, 1996).

The heuristics did not use the balanced dataset to induce the rules of best quality.

The databases used posed different challenges for the algorithms: the Adult database can be considered large and has some strong patterns; the Mushroom database has very obvious and strong patterns; the Contraception database has very weak patterns in particular for some classes; the Forest Cover database is large and contains a very good example of a minority or exceptional class. The databases chosen have a mixture of categorical and numeric attributes.

For each database, the data was partitioned into a train/test partition. Then the algorithms were applied using a set of parameters chosen previously as a result of parameter experimentation. For the algorithms that produce a complete classification, or a set of nuggets or rules, each rule was transformed into a fitness measure using a range of λ values. The λ values used were chosen to represent decreasing accuracy thresholds, and varied from dataset to dataset. At most six different λ values were experimented with for each dataset. Experimentation to find the right λ value for a particular problem is reasonably straightforward, and it involves starting with a high accuracy threshold (close to 100%) and decreasing it at approximately equal steps until the accuracy threshold is equal to the default accuracy for the class being targeted. The process of λ experimentation can be automated.

From the set of rules, or individual rules obtained by each algorithm, the one with the highest fitness measure for each λ value was chosen for comparison to the rules produced by other algorithms. For illustrative purposes, one set of result tables is reproduced in Table 2. In this table, N/R means that no rule was obtained. For each algorithm, the fitness, accuracy and coverage of the best (highest fitness) nugget produced are recorded in the table. The figures in brackets represent the accuracy and coverage obtained by testing the nugget on the test partition. The highest fitness results obtained, corresponding in this case to the heuristics, are highlighted in bold.

Similar tables are presented in de la Iglesia (2001) for each class examined in each database. All results cannot be reproduced here due to space constraints. As a summary, this is how algorithms performed on the different problems:

− Adult database: For this dataset, the heuristic techniques, C5, GRI and Brute produced fit nuggets for both classes consistently at different levels of accuracy and generality. The algorithms 1R and KnowledgeSEEKER produced good results in some cases, but not as consistently. The other algorithms produced results of lower fitness in most cases. The nuggets obtained by the best algorithms were simple rules that represented good descriptions of the classes, with high accuracy and coverage. The measures of accuracy and coverage of the same rules when applied to the test datasets showed that the patterns were not overfitting the training data, as the properties of the nuggets remained stable

Method λ 1.1 1.2 1.3 1.4

Acc T 91 83 77 71

Fitness 455.6 1381.2 2306.8 2232.4

CN2 Acc 95.2(95.3) 95.2(95.3) 95.2(95.3) 95.2(95.3)

Cov 40.9(40.9) 40.9(40.9) 40.9(40.9) 40.9(40.9)

Fitness 455.6 1381.2 2306.8 3322.6

RIPPER Acc 95.2(95.3) 95.2(95.3) 95.2(95.3) 93(92.5)

Cov 40.9(40.9) 40.9(40.9) 40.9(40.9) 45.2(44.1)

Fitness 785.2 2262.4 3739.6 5216.8

BRUTE Acc 95.5(95.7) 95.5(95.7) 95.5(95.7) 95.5(95.7)

Cov 65.2(65.5) 65.2(65.5) 65.2(65.5) 65.2(65.5)

Fitness 798.7 2296.4 3794.1 5291.8

C5 Acc 95.5(95.7) 95.5(95.7) 95.5(95.7) 95.5(95.7)

Cov 66.1(66.3) 66.1(66.3) 66.1(66.3) 66.1(66.3)

Fitness 873.4 2132 3603 5074

KS Acc 97.3(97.6) 94.8(95) 94.8(95) 94.8(95)

Cov 53.3(53) 64.9(65) 64.9(65) 64.9(65)

Fitness 57.5 268.6 1088.4 1908.2

T2 Acc 96(96.1) 85.7(85.8) 85.7(85.8) 85.7(85.8)

Cov 4.3(4.7) 36.2(36.4) 36.2(36.4) 36.2(36.4)

Fitness 28.6 57.2 866.9 3128.2

1R Acc 100(100) 100(100) 79.3(79.5) 79.3(79.5)

Cov 1.3(1.1) 1.3(1.1) 99.8(99.8) 99.8(99.8)

Fitness 476.7 991.8 1508.2 2024.6

PART Acc 99.7(99.5) 99.2(99.2) 99.2(99.2) 99.2(99.2)

Cov 21.7(21.2) 22.8(22.8) 22.8(22.8) 22.8(22.8)

Fitness 163.9 330.8 497.7 664.6

RISE Acc 99.8(99.4) 99.8(99.4) 99.8(99.4) 99.8(99.4)

Cov 7.4(6.8) 7.4(6.8) 7.4(6.8) 7.4(6.8)

Fitness 586.3 1499.6 2412.9 3326.2

GRI Acc 96.5(96.6) 96.5(96.6) 96.5(96.6) 96.5(96.6)

Cov 40.3(40.3) 40.3(40.3) 40.3(40.3) 40.3(40.3)

Fitness N/R N/R N/R N/R

Apriori Acc

Cov

Fitness 954.1 2328.6 3793.2 5293.6

GA Acc 97.7(97.7) 96.3(96.7) 95.6(95.8) 95.7(95.9)

Cov 55.4(55.2) 63.5(63.3) 66(66.2) 65.8(65.9)

Fitness 954.7 2329.8 3803.2 5294.8

SA Acc 97.7(97.7) 96.3(96.7) 95.6(95.8) 95.7(95.9)

Cov 55.4(55.2) 63.5(63.3) 66(66.2) 65.8(65.9)

Fitness 930.1 2329.8 3803.2 5294.8

TA Acc 97.7(97.9) 96.3(96.6) 95.7(95.9) 95.6(95.8)

Cov 53.7(53.4) 63.6(63.4) 65.8(65.9) 66.1(66.3) Table 2: Results of nugget discovery for class ‘<=50K’ in Adult dataset

on new data.

− Mushroom database: This database contains very strong rules (nearly 100 % accurate and 100% covering rules).The heuristics–Brute, C5, KnowledgeSEEKER, and 1R–perform well for both classes, finding the strong rules present. Again, the rules found do not show signs of overfitting. RIPPER performs well for one class, but worse for the other. The other algorithms find rules of worse quality (in most cases rules that are very accurate but of considerably lower coverage) for both classes.

− Contraception database: This database does not contain strong patterns, especially for two of the classes. In particular, those classes do not seem to contain accurate rules of wide coverage, and the accurate rules found show signs of overfitting. The heuristics find good rules at different levels of accuracy/

coverage, for the more obvious class, and good rules of wide coverage for the other two classes, but not good accurate rules for those classes. C5 did not find very good rules, except for the class with the more obvious patterns. CN2 found some accurate rules, but did not find very general rules for any of the classes.

BRUTE produced some reasonably good accurate rules, in particular for the class with the more obvious patterns, but they did show some signs of overfitting. The other algorithms only did well on some isolated cases.

− Forest Cover Type database: Some algorithms could not work with the size of this file so two random samples containing 10% and 5% of the data respectively were used subsequently as alternatives when an algorithm failed to run on the bigger file. In those cases, the rest of the data was used as a test set. For the majority class, the heuristics gave best performance, with C5, Brute and 1R performing well in some cases. The other algorithms presented poor performance for most cases. For the minority class, which presented a real challenge for the algorithms, the heuristic obtained the best rules for all except the highest level of accuracy sought. C5, found a good accurate rule for this class, and performed generally well. RIPPER and CN2 found good accurate rules but the wide coverage rules found by those algorithms were of worse quality. Brute, KnowledgeSEEKER and 1R performed well for the wide covering rules. The rest of the algorithms performed poorly in general.

The results obtained show that the heuristic techniques produce good nuggets for most problems and most λ values in terms of their fitness. This is perhaps not surprising, as they are designed to optimise the fitness measure. Other good performers are C5, Brute and KnowledgeSEEKER. Those algorithms extract rules of high coverage and accuracy and, in general, do not overfit the training data.

However with the other algorithms, a way of selecting some interesting nuggets from the set of nuggets obtained is necessary. We have used the fitness measure to rank the rules, and we have selected in each case the fittest rule. But how good is the fitness measure as an assessment of the interest of a rule? The patterns obtained confirm that a higher fitness measure implies a higher ranking in the partial ordering

for rules that can be ranked by the same partial ordering. This is definitely desirable.

It can also be seen that the λ parameter allows for the variation of the search criteria between accurate and more general patterns. This also seems a good property, as it delivers patterns that go from the accurate or very accurate to the very general, for all databases and classes. Hence ranking patterns according to the fitness measure seems to deliver interesting patterns, and using the heuristic as a means of finding those patterns tends to deliver the fittest patterns.

We have previously said that the heuristic algorithms produce good nuggets for most classes and most λ values, generally outperforming other algorithms. In comparing the three heuristic, it is difficult to judge which is the best technique: SA, GA or TS. They all seem to perform well and mostly produce similarly good results so they would all be a good choice for nugget discovery. TS seems to offer a very robust performance, with the best results for many of the experiments. This is very encouraging considering that the implementation of TS used is quite naive and does not include many of the advanced Tabu Search features that can improve on the search process. A full implementation of TS, including for example aspiration criteria and good intensification and diversification techniques, may be the next step forward, given these results.

In terms of execution time, the three heuristic algorithms performed similarly for most of the experiments undertaken. All three algorithms were run with execution time limits given by similar conditions such as a fixed number of generations without change for the GA, or a fixed number of proposed moves without change for the TA/SA algorithms. Statistics for when the best solution was reached were kept and compared, finding similar performance for all three heuristics.

Execution times are difficult to compare, given that experiments were conducted on different machines, and in competition with other machine users which produced variations in performance, but in general the time to run a single TA, SA or GA experiment was comparable to the time taken to produce, for example, a C5 set of rules. C5 offered one of the best performances when compared with the other algorithms (CN2, Brute, T2, etc.), hence the heuristics were competitive with all other algorithms in terms of execution times.

CONCLUSIONS

In this chapter, we have established nugget discovery as a data mining task in its own right. We have shown that it needs to be guided by a good measure of interest for nugget discovery. We have looked at the properties of a nugget that should be incorporated into a measure of interest. We have presented a partial ordering of rules that should be enforced by measures of interest for nugget discovery. We have then looked at the measures of interest that have traditionally been used by classification (complete or partial) algorithms. We have established that many do not uphold the partial ordering and hence they are not suitable for ordering and selecting nuggets. The fitness measure, a measure which establishes the partial ordering and allows for the variation of the search criteria towards more

accurate or more general rules, has been established.

The heuristic algorithms for nugget discovery have been reviewed. Other algorithms, which can also be used for this task, were briefly introduced. They use the different measures of interest reviewed to produce complete or partial classifications.

Using four databases from the UCI repository, nugget discovery has been performed using all the algorithms presented. In each case, the fitness measure has been used to choose the most interesting nuggets from the set of rules produced by the algorithms. The examination of the nuggets obtained has established that algorithms such as C5, Brute and KnowledgeSEEKER are capable of obtaining good nuggets using their own guiding criteria for the search, and the fitness measure to choose the most interesting nugget from the set obtained. The heuristics have produced the best overall results, and since it is the delivery of interesting nuggets which is the guiding search criteria, they are the best choice for this task.

Tabu Search has shown great potential, and more sophisticated implementations of Tabu Search for data mining must be the focus of future research. Some further work, in particular using the Simulated Annealing algorithm, is also being implemented in the commercial data mining toolkit Datalamp (Howard, 1999) (http://www.datalamp.com). Adaptation of the techniques to databases with many missing values is also an area of research for the future. Algorithms that search for all conjunctive rules that are best according to some criteria of accuracy and applicability (Bayardo & Agrawal, 1999) have been proposed for categorical data.

This is a promising area of research, which would also provide a good benchmark

Dans le document Data Mining: A Heuristic Approach (Page 97-107)