THE BYPASS ALGORITHM - Data Mining: A Heuristic Approach

Now that we have seen basic aspects of LCSs and how the BYPASS represen-tation relates to other learning systems, in this section we review the remaining aspects of the BYPASS algorithm, namely, its initialization, matching, perfor-mance, reinforcement, rule discovery and rule deletion policies. We will establish several further connections with other ideas in the literature. Experimental results are presented in the next section.

The initial population is always generated according to a simple extension of EXM, the exemplar-based generalization procedure, also known as the cover detector (Frey & Slate, 1991; Robertson & Riolo, 1988; Wilson, 1998). EXM randomly selects a single data item (x,y) and builds a single classifier from it. The receptive field Q is constructed by first setting Q = x and then parsing through its coordinates l = 1,...,n: with some fixed generalization probability π, the current value x_l is switched to #, otherwise Q_l= x_l is maintained. As regards R, the likelihood vector c is set to 0, whereas a places most of the mass, say a₀, at the current y and distributes 1-a₀ evenly among the remaining labels. Naturally, α = κ = 0; initial values for ρ and λ are set after the first match. This procedure is repeated until the user-input initial population size, say P₀, is obtained. No exact duplicated receptive fields are allowed into the system.

Typically only the case π = .5 is considered. When EXM is used with larger π, the system is forced to work with more general rules (unless specific rules are produced by the rule discovery heuristic and maintained by the system). Thus, larger π introduce a faster Bayesian learning rate (since rules become active more often) as well as a higher degree of overlap between receptive fields. However, match set size also increases steadily with decreasing specificity h. Some pilot runs are usually needed to select the most productive value for the π parameter. For DM applications, n is typically large and very large π is the only hope of achieving easy-to-interpret rules.

We now look at matching issues. A receptive field is matched if all its

Table 1: Summary of BYPASS main execution parameters. Other less important parameters (and associated values used below) are the initial population size P₀ (50 or 100); the updating constant τ (1/50); the prior hyperparameter a₀ (.7); and GA parameters, namely, type of crossover (uniform) and mutation probability β (5.0 10^-5).

Most system parameters can be changed online.

Parameter Description

π Generalization bias

µ₀ (grace) Utility thresholds

(p,γ) Reward policy

θ GA activity rate

coordinates are. For Boolean predictors l, exact matching is required: |x_l - Q_l| < 1.

Real-valued predictors are linearly transformed and rounded to a 0-15 integer scale prior to training (see Table 2). For integer-valued predictors, matching requires only

|x_l - Q_l| < 2. This window size seems to provide adequate flexibility and remains fixed for all predictors and all classifiers throughout all runs; for another approach to integer handling, see Wilson (2000). Eventually, the match set M may be empty for some x. Note that this becomes less likely as π increases (it will hardly be a problem below). In any case, whenever the empty M condition arises, it is immediately taken care of by a new call to EXM, after which a new input x is selected.

Once the (non-empty) M is ascertained, two system predictions are computed, namely, the single-winner (SW) and mixture-based (MIX) predictions. SW selects the classifier in M with the lowest uncertainty evaluation ρ as its best resource (and ignores the remaining classifiers; see the previous section). The maximum a posteriori (MAP) class label z_SW is then determined from this single R as

z_SW= argmax_1≤j≤k{R_j}.

On the other hand, MIX combines first the m matched Rs into the uniform mixture distribution

R_MIX = (1/m) ∑_1≤s≤mR(s)

and then obtains the prediction z_MIX by MAP selection as before.

The MIX prediction is generally preferred over the SW alternative with regard to the system’s reinforcement and general guidance. The MIX prediction combines multiple sources of information and it does so in a cooperative way. It can be argued that the SW predictive mode tends to favor relatively specific classifiers and therefore defeats to some extent the quest for generality. The bias towards the MIX mode of operation can be seen throughout the design of BYPASS: several system decisions are based on whether z_MIX is correct or not. For example, the rule-generation mechanism is always triggered by MIX failure (regardless of the success of z_SW).

The accuracy ρ is updated at each step (no entropy calculations are needed) according to the familiar discounting scheme

ρ← (1 - τ) ρ + τ S_y,

Table 2: Artificial and real data sets used in the experiments below. n is the number of predictors and k is the number of output categories. Recall that real predictors are linearly transformed and rounded to a 0-15 integer scale prior to training; thus, in general, Q∈{0,1,2,...,9,A,B,C,D,E,F}ⁿ. Uniform weights w_j are used in the first case. For the satellite data, output labels 1, 3 and 6 have higher frequencies f_j and therefore non-uniform weights w_j are used.

n k predictors training test

jmultiplexer 33 8 Boolean 10,000 10,000

satellite 36 6 Real 4,435 2,000

where S_y is the well-known score of each individual classifier in M, namely, S_y = - log (R_y) > 0,

and τ is a small positive number. As noted earlier, the SW prediction is deemed of secondary interest only, so there would be no essential loss if ρ were omitted altogether from the current discussion of performance.

On the other hand, the individual scores S_y of matched rules are central quantities in BYPASS reinforcement policy. Clearly, the lower S_y, the better R at this particular x. Again, several decisions will be made depending on the absolute and relative magnitude of these S_y. Any upper bound set on S_y can be expressed equivalently as a lower bound for R_y. An important point regarding these bounds is that, no matter how they are expressed, they have an intrinsic meaning that can be cross-examined in a variety of related k-way classification problems (obviously, it Table 3: A single cycle based on the second multiplexer population (Figure 1). The first line is the input item (x,y). The second line simply places the eight output labels.

Eight matched classifiers are listed next (only Q, R, and ρ are shown). Predictive distributions R are shown graphically by picking only some labels with the largest predicted probabilities. The SW mode of operation (involving the lowest accuracy) is shown to fail in the next few lines. Finally, the MIX mode is shown to succeed and the p = 3 rewarded units are reported. #s are replaced by dots (and subsets of 11 coordinates are mildly separated) for clarity.

01000111001 11011101010 11101010111 * ABCDEFGH

0/ ... ... .11...1 * * 1.911 1/ ..0..1.1... ... ... * * 1.936 2/ ... 1.0....1... ... * * 1.894 3/ ... ... .1....1..1. ** 1.893 4/ 01...11.... ... ... * * 1.384 5/ ... 110...1. ... ** 1.344 6/ ... 1.0....1.1. ... ** 1.347 7/ ... .10..1...1. ... ** 1.365 Single winner is:

ABCDEFGH

... 110...1. ... ** 1.344 Predicted category is: G

Single winner’s score: 1.367 Combined prediction is:

ABCDEFGH **

Predicted category is: H Combined prediction’s score: 1.494 Rewarded units: (5 6 7)

does not make sense to relate scores based on different k).

Reinforcement takes place every cycle and essentially involves the updating of each matched classifier’s raw utility κ in light of the system’s predicted label z_MIX and the current observation y. Two cases are distinguished, success or failure, and in each case different subsets of classifiers are selected from M for reward. If z_MIX succeeds, then classifiers with the lowest S_y should be the prime contributors. A simple idea is to reward the p lowest values. The κ counter of each of these units is updated as κ ← κ + w_y, where w_y depends in turn on whether all output labels are to be given the same weight or not. If so, then w_j ≡ 1, otherwise, w_j = (kf_j)^-1, where f_j denotes the relative frequency of the j-th output label in the training sample. The rationale is that widely different f_j make it unfair to reward all classifiers evenly.

Even if all output labels show up equally often, we may have more interest in certain labels; an appropriate bias can then be introduced by appropriately chosen w_j.

In the case of z_MIX failure, it can be argued cogently that not all matched

Table 4: Selected classifiers by MAP label for the satellite data. All these classifiers belong to a single population obtained under (p = 5, γ = 1.15). The three best accuracies in each team are extracted. Receptive fields are shown first (see text for details). The last row shows predictive distributions and accuracies (in the same order). The total number of classifiers in each class is 12, 10 and 15 respectively.

Again, #s are transformed into dots for ease of reference.

Label j=4 Label j=5 Label j=6

9.. ..9 ... 5.. ... .4. ... ... ... ..7 ... ...

... 9.. ..7 ... .3. ... ... ... ... ... ... 4..

... ... ... ... .4. ... ... ... 6.. ... ..4 ...

... ... .7. ... ... ... ... ... ... ... .4. 3..

... ... ..7 ... ... .4. .4. ... ... ... ... 4..

..9 ... ... ... ... 5.. ... ... ... ..7 ..4 ...

..8 .8. ... ... .54 ... ... ... ... 6.. ... 3..

.88 ... .7. ... ... 4.. ... ... ... .8. ... ...

... ... ... ... ... .44 ... ... .6. ... ... ...

ABCDEF ABCDEF ABCDEF

* * 0.817 * * 0.256 ** 0.113 * * 0.975 ** 0.371 ** 0.146 * * 0.776 ** 0.113 * * 0.222

classifiers are worth no reward at all. Taking again S_y as the key quantity, a patient reinforcement policy reinforces all rules with scores below certain system threshold γ > 0: their κ counters are increased just as if z_MIX had been successful. A potentially important distinction is therefore made between, say, classifiers whose second MAP class is correct and classifiers assigning very low probability to the observed output label. The main idea is to help rules with promising low scores to survive until a sufficient number of them are found and maintained. In this case they will hopefully begin to work together as a team and thus they will get their reward from correct z_MIX! The resulting reward scheme is thus parameterized by p > 0 and γ ≥ 0. Since the number of matched classifiers per cycle (m) may be rather large, a reward policy reinforcing a single classifier might appear rather “greedy”. For this reason, higher values of p are typically tried out. The higher p, the easier the cooperation among classifiers (match sets are just provided as more resources to establish themselves).

On the other hand, if p is too high then less useful units may begin to be rewarded and the population may become too large. Moderate values of p usually give good results in practice. Parameter γ must also be controlled by monitoring the actual number of units rewarded at a given γ. Again, too generous γ may inflate the population excessively. It appears that some data sets benefit more from γ > 0 than others, the reasons having to do with the degree of overlap among data categories (see table).

Table 5: Selected classifiers by MAP label as in Table 4.

Label j=1 Label j=2 Label j=3

... ... ... ... ... ... ... ... ... ... ... ...

.6. ... ... ... ... ... ... ..D .D. ... ... ...

... .C. ... ... ... ... ... ... ... ... ... ...

.6. ... ... ... ... .0. ... ... ... ... ... ...

.6. ... B.. ... ... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ... C.D ... ... ...

.6. ... C.. ... ... ... ... ... .C. B.. ... ...

... ... ... ... ... ... ... .B. ..C ... ... ..7 ... ... ... ... ... ... ... ... ..D ... ... ...

ABCDEF ABCDEF ABCDEF

* 0.116 * 0.000 ** 0.112

* 0.044 * 0.027 ** 0.106

** 0.018 ** 0.005 ** 0.029

BYPASS rule discovery sub-system includes a genetic algorithm (GA) and follows previous recommendations laid out in the literature. For example, it is important to restrict mating to rules that are known to be related in some way (Booker, 1989). A familiar solution is to use again the match set M as the basic niche:

only rules that belong to the same M will ever be considered for mating (as opposed to a population-wide GA). To complement this idea, the GA is triggered by z_MIX failure. A further control is introduced: at each failure time either the GA itself or the EXM routine will act depending on the system’s score threshold θ. Specifically, the procedure first checks whether there are at least two scores S_y in M lower than θ. If so, standard crossover and mutation are applied as usual over the set of matched receptive fields (Goldberg, 1989). Otherwise, a single classifier is generated by EXM (on the basis of the current datum). The rationale is to restrict further the mating process: no recombination occurs unless there are rules in the match set that have seen a substantial number of instances of the target label y. The θ parameter can be used to strike a balance between the purely random search carried out by EXM and the more focused alternative provided by the GA. This is useful because EXM provides useful variability for the GA to exploit. Since rules created by the GA tend to dominate and often sweep out other useful rules, best results are obtained when θ is relatively demanding (Muruzábal, 1999).

Both standard uniform and single-point crossover are implemented (Syswerda, 1989). In either case, a single receptive field is produced by crossover. Mutation acts on this offspring with some small coordinate-wise mutation probability β. The final Q is endowed with κ = 0, c_j ≡ 0 and a following y as in EXM above. Receptive fields are selected for mating according to the standard roulette-wheel procedure with weights given by the normalized inverses of their scores S_y. As before, exact copies of existing receptive fields are precluded.

Finally, consider the rule deletion policy. At the end of each cycle, all classifiers have their utility µ = κ/α checked for “services rendered to date”. Units Table 6: Bagging performance on the jmultiplexer problem. First three lines: S-PLUS fitting parameters. Next three lines: single (bagged) tree size, test success rate and edge (net advantage of the bagged rate over the single-tree rate). Given figures are averages over five runs. A total of fifty bagging trees were used in all cases. Results are presented in decreasing order of rule generality. Note that no pruning was necessary since trees of the desired size were grown directly in each case.

Series 1 Series 2 Series 3 Series 4

minsize 20 10 8 6

mincut 10 5 4 3

mindev .05 .01 .005 .0025

size 32 120 212 365

success rate 25.2 44.4 51.6 53.2

edge +8.1 +21.8 +27.6 +27.1

become candidates for deletion as soon as their utility µ drops below a given system threshold µ₀. This powerful parameter also helps to promote generality: if µ₀is relatively high, specific classifiers will surely become extinct no matter how low their accuracy ρ. It is convenient to view µ₀as µ₀=1/v, with the interpretation (assuming w_j≡ 1) that classifiers must be rewarded once every v cycles on average to survive. Early versions of the system simply deleted all classifiers with µ < µ₀ at once. It was later thought a good idea to avoid sudden shocks to the population as much as possible. Therefore, only one classifier is actually deleted per cycle, namely the one exhibiting the largest λ. The idea is to maintain all teams or match sets of about the same size. Also, because sometimes a relatively high µ₀is used, a mercy period of guaranteed survival α₀ is granted to all classifiers; that is, no unit with α

≤ α₀ is deleted, (Frey & Slate, 1991). This gives classifiers some time to refine their likelihood vectors before they can be safely discarded. Again, to aid interpretation, parameter α₀ is usually re-expressed as α₀ = mercy × v, so values of mercy are provided in turn.

Two major modes of training operation are distinguished: during effective training, the system can produce as many new units as it needs to. As noted above, effective training can be carried out at a variety of θ values. During cooling, the rule-discovery sub-system is turned off and no new rules are generated (except those due to empty match sets); all other system operations continue as usual. Since popula-tions typically contain a body of tentative rules enjoying the mercy period, utility constraints often reduce the size of the population considerably during the cooling phase. This downsizing has often noticeable effects on performance, and therefore some cooling time is recommended in all runs. Once cooled, the final population is ready for deployment on new data. During this testing phase, it is customary to

“freeze” the population: classifiers no longer evolve (no summaries are updated), and no new rules are injected into the system.

One or several cooled populations may be input again to the system for consolidation and further learning. In this case, the new prior vector a is taken as the old c vector plus a vector of ones (added to prevent zero entries). This seeks naturally that the behavior learned by the older rules demonstrates upon re-initialization from the very first match.

To summarize, BYPASS is a versatile system with just a few, readily interpret-able parameters; Tinterpret-able 1 summarizes them. These quantities determine many different search strategies. As noted earlier, some pilot runs are typically conducted to determine suitable values for these quantities in light of the target goal and all previous domain knowledge. In this chapter, these pilot runs and related discussion are omitted for brevity; no justification for the configurations used is given (and no claim of optimality is made either). Experience confirms that the system is robust in that it can achieve sensible behavior under a broad set of execution parameters.

In any case, a few specific situations that may cause system malfunctioning are singled out explicitly below.

Dans le document Data Mining: A Heuristic Approach (Page 134-141)