• Aucun résultat trouvé

THE BYPASS ALGORITHM

Dans le document Data Mining: A Heuristic Approach (Page 134-141)

Now that we have seen basic aspects of LCSs and how the BYPASS represen-tation relates to other learning systems, in this section we review the remaining aspects of the BYPASS algorithm, namely, its initialization, matching, perfor-mance, reinforcement, rule discovery and rule deletion policies. We will establish several further connections with other ideas in the literature. Experimental results are presented in the next section.

The initial population is always generated according to a simple extension of EXM, the exemplar-based generalization procedure, also known as the cover detector (Frey & Slate, 1991; Robertson & Riolo, 1988; Wilson, 1998). EXM randomly selects a single data item (x,y) and builds a single classifier from it. The receptive field Q is constructed by first setting Q = x and then parsing through its coordinates l = 1,...,n: with some fixed generalization probability π, the current value xl is switched to #, otherwise Ql = xl is maintained. As regards R, the likelihood vector c is set to 0, whereas a places most of the mass, say a0, at the current y and distributes 1-a0 evenly among the remaining labels. Naturally, α = κ = 0; initial values for ρ and λ are set after the first match. This procedure is repeated until the user-input initial population size, say P0, is obtained. No exact duplicated receptive fields are allowed into the system.

Typically only the case π = .5 is considered. When EXM is used with larger π, the system is forced to work with more general rules (unless specific rules are produced by the rule discovery heuristic and maintained by the system). Thus, larger π introduce a faster Bayesian learning rate (since rules become active more often) as well as a higher degree of overlap between receptive fields. However, match set size also increases steadily with decreasing specificity h. Some pilot runs are usually needed to select the most productive value for the π parameter. For DM applications, n is typically large and very large π is the only hope of achieving easy-to-interpret rules.

We now look at matching issues. A receptive field is matched if all its

Table 1: Summary of BYPASS main execution parameters. Other less important parameters (and associated values used below) are the initial population size P0 (50 or 100); the updating constant τ (1/50); the prior hyperparameter a0 (.7); and GA parameters, namely, type of crossover (uniform) and mutation probability β (5.0 10-5).

Most system parameters can be changed online.

Parameter Description

π Generalization bias

µ0 (grace) Utility thresholds

(p,γ) Reward policy

θ GA activity rate

coordinates are. For Boolean predictors l, exact matching is required: |xl - Ql| < 1.

Real-valued predictors are linearly transformed and rounded to a 0-15 integer scale prior to training (see Table 2). For integer-valued predictors, matching requires only

|xl - Ql| < 2. This window size seems to provide adequate flexibility and remains fixed for all predictors and all classifiers throughout all runs; for another approach to integer handling, see Wilson (2000). Eventually, the match set M may be empty for some x. Note that this becomes less likely as π increases (it will hardly be a problem below). In any case, whenever the empty M condition arises, it is immediately taken care of by a new call to EXM, after which a new input x is selected.

Once the (non-empty) M is ascertained, two system predictions are computed, namely, the single-winner (SW) and mixture-based (MIX) predictions. SW selects the classifier in M with the lowest uncertainty evaluation ρ as its best resource (and ignores the remaining classifiers; see the previous section). The maximum a posteriori (MAP) class label zSW is then determined from this single R as

zSW = argmax1≤j≤k {Rj}.

On the other hand, MIX combines first the m matched Rs into the uniform mixture distribution

RMIX = (1/m) ∑1≤s≤m R(s)

and then obtains the prediction zMIX by MAP selection as before.

The MIX prediction is generally preferred over the SW alternative with regard to the system’s reinforcement and general guidance. The MIX prediction combines multiple sources of information and it does so in a cooperative way. It can be argued that the SW predictive mode tends to favor relatively specific classifiers and therefore defeats to some extent the quest for generality. The bias towards the MIX mode of operation can be seen throughout the design of BYPASS: several system decisions are based on whether zMIX is correct or not. For example, the rule-generation mechanism is always triggered by MIX failure (regardless of the success of zSW).

The accuracy ρ is updated at each step (no entropy calculations are needed) according to the familiar discounting scheme

ρ← (1 - τ) ρ + τ Sy,

Table 2: Artificial and real data sets used in the experiments below. n is the number of predictors and k is the number of output categories. Recall that real predictors are linearly transformed and rounded to a 0-15 integer scale prior to training; thus, in general, Q{0,1,2,...,9,A,B,C,D,E,F}n. Uniform weights wj are used in the first case. For the satellite data, output labels 1, 3 and 6 have higher frequencies fj and therefore non-uniform weights wj are used.

n k predictors training test

jmultiplexer 33 8 Boolean 10,000 10,000

satellite 36 6 Real 4,435 2,000

where Sy is the well-known score of each individual classifier in M, namely, Sy = - log (Ry) > 0,

and τ is a small positive number. As noted earlier, the SW prediction is deemed of secondary interest only, so there would be no essential loss if ρ were omitted altogether from the current discussion of performance.

On the other hand, the individual scores Sy of matched rules are central quantities in BYPASS reinforcement policy. Clearly, the lower Sy, the better R at this particular x. Again, several decisions will be made depending on the absolute and relative magnitude of these Sy. Any upper bound set on Sy can be expressed equivalently as a lower bound for Ry. An important point regarding these bounds is that, no matter how they are expressed, they have an intrinsic meaning that can be cross-examined in a variety of related k-way classification problems (obviously, it Table 3: A single cycle based on the second multiplexer population (Figure 1). The first line is the input item (x,y). The second line simply places the eight output labels.

Eight matched classifiers are listed next (only Q, R, and ρ are shown). Predictive distributions R are shown graphically by picking only some labels with the largest predicted probabilities. The SW mode of operation (involving the lowest accuracy) is shown to fail in the next few lines. Finally, the MIX mode is shown to succeed and the p = 3 rewarded units are reported. #s are replaced by dots (and subsets of 11 coordinates are mildly separated) for clarity.

01000111001 11011101010 11101010111 * ABCDEFGH

0/ ... ... .11...1 * * 1.911 1/ ..0..1.1... ... ... * * 1.936 2/ ... 1.0....1... ... * * 1.894 3/ ... ... .1....1..1. ** 1.893 4/ 01...11.... ... ... * * 1.384 5/ ... 110...1. ... ** 1.344 6/ ... 1.0....1.1. ... ** 1.347 7/ ... .10..1...1. ... ** 1.365 Single winner is:

ABCDEFGH

... 110...1. ... ** 1.344 Predicted category is: G

Single winner’s score: 1.367 Combined prediction is:

ABCDEFGH **

Predicted category is: H Combined prediction’s score: 1.494 Rewarded units: (5 6 7)

does not make sense to relate scores based on different k).

Reinforcement takes place every cycle and essentially involves the updating of each matched classifier’s raw utility κ in light of the system’s predicted label zMIX and the current observation y. Two cases are distinguished, success or failure, and in each case different subsets of classifiers are selected from M for reward. If zMIX succeeds, then classifiers with the lowest Sy should be the prime contributors. A simple idea is to reward the p lowest values. The κ counter of each of these units is updated as κ ← κ + wy, where wy depends in turn on whether all output labels are to be given the same weight or not. If so, then wj ≡ 1, otherwise, wj = (kfj)-1, where fj denotes the relative frequency of the j-th output label in the training sample. The rationale is that widely different fj make it unfair to reward all classifiers evenly.

Even if all output labels show up equally often, we may have more interest in certain labels; an appropriate bias can then be introduced by appropriately chosen wj.

In the case of zMIX failure, it can be argued cogently that not all matched

Table 4: Selected classifiers by MAP label for the satellite data. All these classifiers belong to a single population obtained under (p = 5, γ = 1.15). The three best accuracies in each team are extracted. Receptive fields are shown first (see text for details). The last row shows predictive distributions and accuracies (in the same order). The total number of classifiers in each class is 12, 10 and 15 respectively.

Again, #s are transformed into dots for ease of reference.

Label j=4 Label j=5 Label j=6

9.. ..9 ... 5.. ... .4. ... ... ... ..7 ... ...

... 9.. ..7 ... .3. ... ... ... ... ... ... 4..

... ... ... ... .4. ... ... ... 6.. ... ..4 ...

... ... .7. ... ... ... ... ... ... ... .4. 3..

... ... ..7 ... ... .4. .4. ... ... ... ... 4..

..9 ... ... ... ... 5.. ... ... ... ..7 ..4 ...

..8 .8. ... ... .54 ... ... ... ... 6.. ... 3..

.88 ... .7. ... ... 4.. ... ... ... .8. ... ...

... ... ... ... ... .44 ... ... .6. ... ... ...

ABCDEF ABCDEF ABCDEF

* * 0.817 * * 0.256 ** 0.113 * * 0.975 ** 0.371 ** 0.146 * * 0.776 ** 0.113 * * 0.222

classifiers are worth no reward at all. Taking again Sy as the key quantity, a patient reinforcement policy reinforces all rules with scores below certain system threshold γ > 0: their κ counters are increased just as if zMIX had been successful. A potentially important distinction is therefore made between, say, classifiers whose second MAP class is correct and classifiers assigning very low probability to the observed output label. The main idea is to help rules with promising low scores to survive until a sufficient number of them are found and maintained. In this case they will hopefully begin to work together as a team and thus they will get their reward from correct zMIX! The resulting reward scheme is thus parameterized by p > 0 and γ ≥ 0. Since the number of matched classifiers per cycle (m) may be rather large, a reward policy reinforcing a single classifier might appear rather “greedy”. For this reason, higher values of p are typically tried out. The higher p, the easier the cooperation among classifiers (match sets are just provided as more resources to establish themselves).

On the other hand, if p is too high then less useful units may begin to be rewarded and the population may become too large. Moderate values of p usually give good results in practice. Parameter γ must also be controlled by monitoring the actual number of units rewarded at a given γ. Again, too generous γ may inflate the population excessively. It appears that some data sets benefit more from γ > 0 than others, the reasons having to do with the degree of overlap among data categories (see table).

Table 5: Selected classifiers by MAP label as in Table 4.

Label j=1 Label j=2 Label j=3

... ... ... ... ... ... ... ... ... ... ... ...

.6. ... ... ... ... ... ... ..D .D. ... ... ...

... .C. ... ... ... ... ... ... ... ... ... ...

.6. ... ... ... ... .0. ... ... ... ... ... ...

.6. ... B.. ... ... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ... C.D ... ... ...

.6. ... C.. ... ... ... ... ... .C. B.. ... ...

... ... ... ... ... ... ... .B. ..C ... ... ..7 ... ... ... ... ... ... ... ... ..D ... ... ...

ABCDEF ABCDEF ABCDEF

* 0.116 * 0.000 ** 0.112

* 0.044 * 0.027 ** 0.106

** 0.018 ** 0.005 ** 0.029

BYPASS rule discovery sub-system includes a genetic algorithm (GA) and follows previous recommendations laid out in the literature. For example, it is important to restrict mating to rules that are known to be related in some way (Booker, 1989). A familiar solution is to use again the match set M as the basic niche:

only rules that belong to the same M will ever be considered for mating (as opposed to a population-wide GA). To complement this idea, the GA is triggered by zMIX failure. A further control is introduced: at each failure time either the GA itself or the EXM routine will act depending on the system’s score threshold θ. Specifically, the procedure first checks whether there are at least two scores Sy in M lower than θ. If so, standard crossover and mutation are applied as usual over the set of matched receptive fields (Goldberg, 1989). Otherwise, a single classifier is generated by EXM (on the basis of the current datum). The rationale is to restrict further the mating process: no recombination occurs unless there are rules in the match set that have seen a substantial number of instances of the target label y. The θ parameter can be used to strike a balance between the purely random search carried out by EXM and the more focused alternative provided by the GA. This is useful because EXM provides useful variability for the GA to exploit. Since rules created by the GA tend to dominate and often sweep out other useful rules, best results are obtained when θ is relatively demanding (Muruzábal, 1999).

Both standard uniform and single-point crossover are implemented (Syswerda, 1989). In either case, a single receptive field is produced by crossover. Mutation acts on this offspring with some small coordinate-wise mutation probability β. The final Q is endowed with κ = 0, cj ≡ 0 and a following y as in EXM above. Receptive fields are selected for mating according to the standard roulette-wheel procedure with weights given by the normalized inverses of their scores Sy. As before, exact copies of existing receptive fields are precluded.

Finally, consider the rule deletion policy. At the end of each cycle, all classifiers have their utility µ = κ/α checked for “services rendered to date”. Units Table 6: Bagging performance on the jmultiplexer problem. First three lines: S-PLUS fitting parameters. Next three lines: single (bagged) tree size, test success rate and edge (net advantage of the bagged rate over the single-tree rate). Given figures are averages over five runs. A total of fifty bagging trees were used in all cases. Results are presented in decreasing order of rule generality. Note that no pruning was necessary since trees of the desired size were grown directly in each case.

Series 1 Series 2 Series 3 Series 4

minsize 20 10 8 6

mincut 10 5 4 3

mindev .05 .01 .005 .0025

size 32 120 212 365

success rate 25.2 44.4 51.6 53.2

edge +8.1 +21.8 +27.6 +27.1

become candidates for deletion as soon as their utility µ drops below a given system threshold µ0. This powerful parameter also helps to promote generality: if µ0 is relatively high, specific classifiers will surely become extinct no matter how low their accuracy ρ. It is convenient to view µ0 as µ0=1/v, with the interpretation (assuming wj ≡ 1) that classifiers must be rewarded once every v cycles on average to survive. Early versions of the system simply deleted all classifiers with µ < µ0 at once. It was later thought a good idea to avoid sudden shocks to the population as much as possible. Therefore, only one classifier is actually deleted per cycle, namely the one exhibiting the largest λ. The idea is to maintain all teams or match sets of about the same size. Also, because sometimes a relatively high µ0 is used, a mercy period of guaranteed survival α0 is granted to all classifiers; that is, no unit with α

≤ α0 is deleted, (Frey & Slate, 1991). This gives classifiers some time to refine their likelihood vectors before they can be safely discarded. Again, to aid interpretation, parameter α0 is usually re-expressed as α0 = mercy × v, so values of mercy are provided in turn.

Two major modes of training operation are distinguished: during effective training, the system can produce as many new units as it needs to. As noted above, effective training can be carried out at a variety of θ values. During cooling, the rule-discovery sub-system is turned off and no new rules are generated (except those due to empty match sets); all other system operations continue as usual. Since popula-tions typically contain a body of tentative rules enjoying the mercy period, utility constraints often reduce the size of the population considerably during the cooling phase. This downsizing has often noticeable effects on performance, and therefore some cooling time is recommended in all runs. Once cooled, the final population is ready for deployment on new data. During this testing phase, it is customary to

“freeze” the population: classifiers no longer evolve (no summaries are updated), and no new rules are injected into the system.

One or several cooled populations may be input again to the system for consolidation and further learning. In this case, the new prior vector a is taken as the old c vector plus a vector of ones (added to prevent zero entries). This seeks naturally that the behavior learned by the older rules demonstrates upon re-initialization from the very first match.

To summarize, BYPASS is a versatile system with just a few, readily interpret-able parameters; Tinterpret-able 1 summarizes them. These quantities determine many different search strategies. As noted earlier, some pilot runs are typically conducted to determine suitable values for these quantities in light of the target goal and all previous domain knowledge. In this chapter, these pilot runs and related discussion are omitted for brevity; no justification for the configurations used is given (and no claim of optimality is made either). Experience confirms that the system is robust in that it can achieve sensible behavior under a broad set of execution parameters.

In any case, a few specific situations that may cause system malfunctioning are singled out explicitly below.

Dans le document Data Mining: A Heuristic Approach (Page 134-141)