• Aucun résultat trouvé

4.1 Classifier design for object data

In Section 1.1 we defined a classifier as any function D: St^ i-^ N . The value y = D(z) is the label vector for z in 91^. D is a crisp classifier if D[91P] = N ; otherwise, the classifier is fuzzy, possibilistic or probabilistic, which for convenience we lump together as soft classifiers. This chapter describes some of the most basic (and often m o s t useful) classifier designs, along with some fuzzy generalizations and relatives.

Soft classifier functions D: 9tP i-> N are consistent with the

pc

principle of least commitment (Marr, 1982), which states t h a t algorithms should avoid making crisp decisions as long as possible, since it is very difficult (if not impossible) to recover from a wrong crisp classification. This is particularly true in complex systems such as an automatic target recognition system, or a computer aided medical diagnostician that uses image data, because there are several stages where decisions are made, each affecting those that follow. For example, pixels in a raw image need to be classified as noise points for preprocessing, objects need to be segmented from the preprocessed images, features must be extracted and the objects classified, and the entire "scene" needs to be labeled. While we use mostly simple data sets to illustrate some of the algorithms in this chapter, keep in mind complex scenarios such as the ones j u s t described to appreciate the potential benefits of fuzzy recognition approaches.

Many classifiers assign non-crisp labels to their arguments. When this happens, we often use the hardening function H:N f-> N,

^^ '^ pc he

defined at (1.15) to convert non-crisp labels into crisp ones; for c classes, H o D(y) = H(D(y)) e {e^ e^}.

Designing a classifier simply means "finding a good D ". When this is done with labeled training data, the process is called supervised learning. We pointed out in Chapter 2 that it is the labels of the data that supervise; we will meet other forms of supervision later in this chapter, and they are also appropriately called supervised learning.

D may be specified functionally (e.g., the Bayes classifier), or as a comiputer program (e.g. computational neural networks or fuzzy input-output systems). Both types of classifiers have parameters.

When D is a function, it has constants that need to be "learned"

during training. When D is a computer program, the model it implements h a s both control parameters and constants that must

also be acquired by "learning". In either case the word learning means finding good parameters for D - and that's all it means.

In supervised classifier design X is usually crisply partitioned into a training (or design) set X with label matrix U and cardinality

X. = n , ; and a test set X = (X - X ) with label matrix U a n d

I tr| tr te tr te

cardinality X. = n, . Columns of U and U are label vectors in

•^ I te| te tr te

Np(,. Testing a classifier designed with X means estimating its error rate (or probability of misclassification). The standard method for doing this is to submit X to D and count mistakes (U must have

te te

crisp labels to do this). This yields the apparent error rate E (X |X ). Apparent error rates are conveniently tabulated using the c X c confusion matrix C = [c ] = [ # labeled class j I b u t were

i j •* '

really class i]. (Some writers call C^ the confusion matrix.) More formally, the apparent error rate of D when trained with X a n d tested with X is Equation (4.1) gives, as a fraction In [0, 1], the number of errors committed on test. This number is a function not only of D, but of two specific data sets, and each time any of the three parameters changes, E will in all likelihood change too.

Other common terms for the error rate E„(X IX ) include test error

D te ' tr

and generalization error. Our notation indicates that D was trained with X , and tested with X . E is often the performance index by which D is Judged, because it measures the extent to which D generalizes to the test data. Some authors call E (X IX ) the "true"

* D ' te' tr

error rate of D, but to us, this term refers to a quantity that is not computable with estimates made using finite sets of data.

E (XIX) is the resubstitution error rate (some authors use this term synonomously with apparent error rate, b u t we prefer to have separate terms for these two estimates). Other common terms for E {X|X) include training error and recall error rate. Resubstitution uses the same data for training and testing, so it usually produces a somewhat optimistic error rate. That is, E (X|X) is not as reliable as E (X IX ) for assessing qeneralization, b u t this is not an

D te ' tr o •»

impediment to using E (X|X) as a basis for comparison of different designs. Moreover, unless n is very large compared to p and c (an

CLASSIFIER DESIGN 185 often used rule of t h u m b is n e [lOpc, lOOpc]), the crediblhty of either error rate Is questionable. An unfortunate terminology associated with algorithms that reproduce all the labels (I.e., make no errors) upon resubstitutlon of the training data is that some authors call such a method consistent (Dasarathy, 1994). Don't confuse this with other uses of the term, as for example, a consistent statistic.

A third error rate that is sometimes used is called the validation error of D. This idea springs from the increasingly frequent practice of using X to decide when D is "well trained", by repeatedly computing E (X | X ) while varying the parameters of D a n d / o r X . Knowing that they want the minimum test error rate, many Investigators train D with X , test it with X , and then repeat the

" tr te ^

training cycle with X for other choices (such as the number of nodes in a hidden layer of a neural network), until they achieve a minimal or acceptable test error. On doing this, however, X unwittingly becomes part of the training data (this is called "training on the testing data by Duda and Hart, 1973).

To overcome this complication, some researchers now subdivide X into three disjoint sets: X = X, u X, u X , where X is called a

J tr te va va

validation set. When this is done, X^^ u X^^ can be regarded as the

"overall" training data, and X as the "real" (or blind) test data.

Some authors now report all three of these error rates for their classifiers : resubstitutlon, test and validation errors. Moreover, some authors interchange the terms test and validation as we have used them, so when you read about these error rates, just make sure you know what the authors mean by each term. We won't bother trying to find a sensible notation for what we call the validation error rate (it would be something like E„(X IX ; X )). For the few

^ ^ D v a ' te tr

cases that we discuss in this chapter that have this feature, we will simply use the phrase "validation error" for this third error rate.

Finally, don't confuse "validation error" with the term "cross-validation", which is a method for rotating (sometimes called Jackknifing) through the pair of sets X and X without using a third

set such as X .

va

The small data sets used in some of our examples do not often justify worrying about the difference between Ejj(X|X) and E (X |X ), but in real systems, at least E(X |X ) should always be used, and the selection and manipulation of the three sets {X , X , X } is a very Important aspect of system design. At the minimum, it is good practice to reverse the roles of X and X , redesign D, and compute (4.1) for the new design. If the two error rates obtained by this "cross

validation" procedure are quite different, this indicates that the data used for design and test are somehow biased and should be tested a n d / o r replaced before system design proceeds.

Cross validation is sometimes called "1-fold cross validation", in contrast to k-fold cross validation, where the cross validation cycle is repeated k > 1 times, using different pairs (X , X ) for each pair of cross validation tests. Terms for these training strategies are far from standard. Some writers use the term "k-fold cross validation"

for rotation through the data k time without "crossing" - that is, the total number of training/test cycles is k; "crossing" each tiraie in the sense used here results in 2k train/test cycles. And some authors use the term "cross validation" for the scheme based on the decomposition of X into {X , X , X } j u s t discussed, e.g., (Haykin, 1996). There are a variety of more sophisticated schemes for constructing design and test procedures; see Toussaint (1974) or Lachenbruch (1975) for good discussions of the "rotation" and

"leave-one-out" procedures.

There is another aspect to the handling of training and test data in the design of any real classifier system that is related to the fact that training is almost always based on some form of random initialization. This includes most classifiers built from, for example: clustering algorithms, single and multiple prototype decision functions, fuzzy integral classifiers, many variants of sequential classifier designs based on competitive learning models, decision tree models, fuzzy systems, and recognition systemis based on neural networks. The problem arises because in practice -training data are normally limited. So, given a set X of labeled data, the question is: how do you get a good error estimate and yet give the

"customer" the best classifier. If the classifier can change due to random initialization (we will see this happen in this chapter), then you are faced with the training and testing dilemma:

% If you use all the data to produce (probably) the best classifier you c a n for your customer, you can only give the resubstitution error rate, which is almost always overly optimistic.

€ If you split the data and rotate through different training sets to get better test statistics, then which of the classifiers built during training do you deliver to your customer?

Consider, for example, the leave-one-out estimate of the error rate, in which n classifiers {D,} are designed with n-1 of the data, and

k °

each design is then tested with the remaining datum, in sequence, n times. Since the (D } can all behave differently, and certainly will have different parameters, it is not clear that the leave-one-out error rate is very realistic as far as estimating the performance of a