Feature Analysis - Indirect measures for non-point prototype models

ai,t = ij^-vtii7lXi

F. Indirect measures for non-point prototype models

2.5 Feature Analysis

Methods t h a t explore and improve raw d a t a are broadly characterized a s feature analysis. This includes scaling, normalization, filtering, and smoothing. Any transformation

<l):9tP i-> 9^'^ does/eature extraction when applied to X. Usually q « p, but there are cases where q ^ p. For example, transformations of d a t a can sometimes make them linearly separable in higher dimensions (cf. functional link nets, Zurada, 1992). For a second example where q > p, in image processing each pixel is often associated with a vector of many variables (gray level at the pixel, gradients, texture measures, entropy, average, standard deviation, etc.) built from, for example, intensity values in a rectangular window a b o u t t h e pixel. Examples of feature extraction t r a n s f o r m a t i o n s i n c l u d e F o u r i e r t r a n s f o r m s , p r i n c i p a l components, and feature vectors built from window intensities in images.

The goals of extraction and selection are: to improve the data for solving a particular problem; to compress feature space to reduce time and space complexity; and to eliminate redundant (dependent) and unimportant (for the problem at hand) features. When p is large, it is often desirable to reduce it to q « p . Feature selection consists of choosing subsets of the original measured features. Features are selected by taking <1> to be a projection onto some coordinate subspace of 9t^. If q « p, time and space complexity of algorithms that use the transformed data can be significantly reduced. Our next example uses a cartoon type illustration to convey the ideas of feature nomination, measurement, selection and the construction of object vector data.

Example 2.19 Three fruits are shown in Figure 2.21; an apple, an orange and a pear. In order to ask and attempt to answer questions about these objects by computational means, we need an object vector representation of each fruit. A h u m a n m u s t n o m i n a t e features that seem capable of representing properties of each object that will be useful in solving some problem. The choices shown in column two are ones that allow u s to formulate and answer some (but not all!) questions that might be posed about these fruits.

Once the features (mass, shape, texture) are nominated, sensors measure their values for each object in the sample. The mass of each fruit is readily obtainable, but the shape and texture values require more thought, more time, more work, and probably will be more expensive to collect.

Nominate

Figure 2.21 Feature analysis and object data

A n u m b e r of definitions could yield shape m e a s u r e s for the diameter. We might take the diameter as the maximum distance between any pair of points on the boundary of a silhouette of each fruit. This will be an expensive feature to measure, and it may not capture a property of the various classes that is useful for the purpose at hand. Finally, texture can be represented by a binary variable, say 0 = "smooth" and 1 = "rough". It may not be easy or cheap to automate the assignment of a texture value to each fruit, but it can be done. After setting up the measurement system, each fruit passes through it, generating an object vector of measurements. In Figure 2.21 each feature vector is in p=3 space, x^ e 9t^.

Suppose the problem is to separate the citrus fruits from the non-citrus fruits, samples being restricted to apples, oranges and pears.

Given this constraint, the only feature we need to inspect is the third one (texture). Oranges are the solution to the problem, and they will (almost) always have rough texture, whereas the apples and pears

CLUSTER ANALYSIS 123 generally will not. Thus, as shown in Figure 2.21, we may select texture, and disregard the first and second features, when solving this problem. This reduces p from 3 to 1, and m a k e s the computational solution simpler and possibly more accurate, since calculations involving all three features use measurements of the other variables t h a t may make the data more mixed in higher dimensions. The feature selection function O t h a t formally accomplishes this is the projection 0(x,, x^, x„) = Xo. It is certainly possible for an aberrant sample to trick the system - that is, we cannot expect a 100% success rate, because real data exhibits noise (in this example noise corresponds to, say, a very rough apple).

Several further points. First, w h a t if the d a t a contained a pineapple? This fruit h a s a much rougher texture than oranges, but is not a citrus fruit, so in the first place, texture alone is insufficient.

Moreover, the texture measurement would have to be modified to, perhaps, a ternary variable; 0 = smooth, l=rough, and 2 = very rough. Although it is easy to say this verbally, remember that the system under design must convert the texture of each fruit into one of the numbers 0, 1 or 2. This is possible, but may be expensive.

Finally, the features selected depend on the question you are attempting to answer. For example, if the problem was to remove from a conveyor belt all fruits that were too small for a primary market, then texture would be useless. One or both of the first two variables would work better for this new problem. However, the diameter and weight of each fruit are probably correlated.

Statistical analysis might yield a functional relationship between these two features. One of the primary uses of feature analysis is to remove redundancy in measured features. In this example, the physical meaning of the variables suggests a solution; in more subtle cases, computational analysis is often the only recourse.

It is often unclear which features will be good at separating the clusters in an object data set. Hence, a large number - perhaps hundreds - of features are often proposed. While some are intuitive, as those in Example 2.19, many useful features have no intuitive or physically plausible meaning. For example, the coefficients of the normalized Fourier descriptors of the outline of digitized shapes in the plane are often quite useful for shape analysis (Bezdek et al., 1981c), b u t these extracted features have no direct physical interpretation a s properties of the objects whose s h a p e s they represent. Even finding the "best" subset of selected features to use is computationally so expensive as to be prohibitive.

Any feature extraction method that produces Y = <I>[X] c 9?i can be used to make visual displays by taking q = 1, 2, or 3 and plotting Y

on a rectangular coordinate system. In this category, for example, are feature extraction functions such as the linear transformations defined by principal components matrices, and feature extraction algorithms such as Sammon's method (Sammon, 1969). A large class of transformations, however, produce only visual displays from X (and not data sets Y c % 5R^ or 9?^) through devices other t h a n s c a t t e r p l o t s . In this category are functions s u c h a s trigonometric plots (Andrews, 1972) and pictogram algorithms such as Chernoff faces (Chemoff, 1973), and trees and castles (Kleiner and Hartlgan, 1981).

The simplest and most natural method for selecting 1, 2 or 3 features from a large feature set is to visually examine each possible feature combination. Even this can be computationally challenging, since p features, for example, offer p(p-l) two dimensional planes upon which to project the data. Moreover, visual assessment of projected subsets can be very deceptive, as we now illustrate.

Example 2.20 The center of Figure 2.22 is a scatterplot of 30 points X

= {(x , X )} whose coordinates are listed in columns 2 and 3 of Table 2.14. The data are indexed so that points 1-10, 11-20 and 21-30 correspond to the three visually apparent clusters. Projection of X onto the first and second coordinate cixes results in the one-dimensional data sets X and X . This illustrates feature selection.

1 2

Xa X = XiXX2c5K^ Xi + X^ c 9 t

Xa c 9^ <~-1

i

'o8

10 -^i-"-Q!^Bccv'^"'-^^--^^-"-€ix3rTr^'^

^ > X i

15 XjcSt

Figure 2.22 Feature selection and extraction

CLUSTER ANALYSIS 125

The one dimensional data ^(X^ +X2) in Figure 2.22 (plotted to the right of X, not to scale) is made by averaging the coordinates of each vector in X. Geometrically this amounts to orthogonal projection of X onto the line x. X . This illustrates feature extraction.

Table 2.14 Data and terminal FCM cluster 1 for four data sets

^ 1 ^ 2 Init Init Init X

^ 1

iix^+x^)

^ 2

u

(10)

u

(20)

u

(30)

U(l, " ( 1 ) " ( I ) U(l)

^ 1 1.5 2.5 1 0 0 0.99 1.00 1.00 0.00

^2 1.7 2.6 0 1 0 0.99 1.00 0.99 0.03

X 3 1.2 2.2 0 0 1 0.99 0.99 0.98 0.96

X 4 1.8 2.0 1 0 0 1.00 1.00 1.00 0.92

X 5 1.7 2.1 0 1 0 1.00 1.00 1.00 0.99

^ 6 1.3 2.3 0 0 1 0.99 0.99 0.99 0.63

^ 7 2.1 2.0 1 0 0 0.99 0.99 1.00 0.92

X g 2.3 1.9 0 1 0 0.97 0.98 1.00 0.82

Xg 2.0 2.4 0 0 1 0.99 1.00 0.98 0.17

^ 1 0 1.9 2.2 1 0 0 1.00 1.00 1.00 0.96

^ 1 6.0 1.2 0 1 0 0.01 0.01 0.01 0.02

^ 1 2 6.6 1.0 0 0 1 0.00 0.00 0.00 0.00

^ 1 3 5.9 0.9 1 0 0 0.02 0.02 0.07 0.02

"^14 6.3 1.3 0 1 0 0.00 0.00 0.00 0.07

^ 5 5.9 1.0 0 0 1 0.02 0.02 0.05 0.00

^ 6 7.1 1.0 1 0 0 0.01 0.01 0.02 0.00

^ 7 6.5 0.9 0 1 0 0.00 0.00 0.00 0.02

^ 8 6.2 1.1 0 0 1 0.00 0.00 0.01 0.00

"^19 7.2 1.2 1 0 0 0.02 0.02 0.03 0.02

^ 2 0 7.5 1.1 0 1 0 0.03 0.03 0.04 0.00

^ 2 1 10.1 ^2.5 ⁰ ⁰ ¹ 0.01 0.01 0.01 0.00

^ 2 2 11.2 2.6 1 0 0 0.00 0.00 0.00 0.03

^ 2 3 10.5 2.5 0 1 0 0.01 0.01 0.00 0.00

^ 2 4 12.2 2.3 0 0 1 0.01 0.01 0.01 0.63

"^25 10.5 2.2 1 0 0 0.01 0.01 0.01 0.96

^ 2 6 11.0 2.4 0 1 0 0.00 0.00 0.00 0.17

^ 2 7 12.2 2.2 0 0 1 0.01 0.01 0.01 0.96

"^28 10.2 2.1 1 0 0 0.01 0.01 0.02 0.99

^^29 11.9 2.7 0 1 0 0.01 0.01 0.01 0.09

^ 3 0 11.5 ^2.2 ⁰ ⁰ ¹ 0.00 0.00 0.00 0.96

Visual inspection should convince you that the three clusters seen in X, X and i (X^ + X2) will be properly detected by most clustering algorithms. Projection of X onto its second axis, however, mixes the data in the two upper clusters and results in Just two clusters in X . This illustrates t h a t projections of high dimensional data into lower (often visual) dimensions cannot be relied upon to show much about cluster structure in the original data as explained next.

The results of applying FCM to these four data sets with c = 3, m = 2, e

= 0.01, and the Euclidean norm for both termination and J are

shown in Table 2.14, which also shows the (poor) initialization used. Only memberships in the first cluster are shown. In Table 2.14 the three clusters are blocked into their visually apparent subsets of 10 points each. As expected, FCM discovers three very distinct fuzzy clusters in X, X, and i (X, + X„) (not shovm in Table 2.14). For X, X and ^(Xj + X2) all memberships for the first ten points are > 0.97, and memberships of the remaining 20 points in this cluster are less t h a n or equal to 0. 07. For X , however, this cluster h a s eight anomalies with respect to the original data.

When the columns of U ^.. forX„ are hardened, this cluster contains

FCM 2

the 12 points (underlined in Table 2.14) numbered 3, 4, 5, 6, 7, 8, 10, 24, 25, 27, 28 and 30; the last five of these belong to cluster 3 in X, and the points numbered 1,2, and 9 should actually belong to this cluster, but do not.

Example 2.20 shows the effects of changing features and then clustering the transformed data. Conversely, clustering can sometimes be used to extract or select features. Bezdek and Castelaz(1977) illustrate how to use terminal point prototypes from FCM to select subsets of triples from a set of 11 (binary-valued) features for 300 stomach disease patients. Their experiments showed that the average error committed by a nearest prototype classifier (cf. Chapter 4) was nearly identical for the original data and the selected feature triples. We discuss this in more detail in Chapter 4, b u t mention it here simply to illustrate t h a t fuzzy clustering can be used in the context of feature analysis.

Another possibility is to use the c distances from each point in the original data to the cluster centers as (c-dimensional) extracted features that replace the original p-dimensional features. We close chapter 2 with an example that shows how important feature selection can be in the context of a real data application -segmentation of a digital image.

CLUSTER ANALYSIS 127

Example 2.21 To show how feature selection can effect a real world pattern recognition problem, consider the segmentation of a 7 c h a n n e l satellite image t a k e n from (Barni et al., 1996,

& i s h n a p u r a m and Keller, 1996). Figure 2.23(a) shows channel 1.

Barni et al. applied FCM pixel-based clustering with c = 4 to this multispectraJ image, which had p = 7 bands with spatial dimensions 512x699. Thus, data set X contained n = 512x699 = 357,888 pixel vectors in p = 7-space. (pixel vector x^. =(Xj...,x^..)^ e X is the vector of 7 intensities taken from the spatial location in the image with address (i,j), 1 < i < 512; 1 < j < 699.) In this example we processed the image for two sets of features wiih FCM using c = 4, m = 2, the Euclidean norm for both termination and J , and e = 0.1. FCM

was initialized v^rith the first four pixel vectors from the image as V .

Figure 2.23 (a) Channel 1 of a 7 band satellite image

While this image h a s 4 main clusters (water, woods, agricultural land, and urban areas), when viewed in proper resolution there are many regions that do not fit into any of the four main categories.

For example, what about beaches? Do we distinguish between roads, bridges and buildings or lump them all into the category of urban areas? In the latter case, do the features allow us to do that?

-'" \

)• »:.

Figure 2.23 (b) FCM segmentation using all 7 features

^' ^• ^<

.^'

. > •

t . ^ ^ '

\ ' .

Figure 2.23(c) FCM segmentation using only 2 features

CLUSTER ANALYSIS 129 The seven channels in this image are highly correlated. To illustrate

this, we show the FCM segmentation when all 7 channels are used (Figure 2.23(b)), and when only channels 5 and 6 are used (Figure 2.23(c)). Visually similar results imply that channels 1-4 and 7 don't contribute much to the result in Figure 2.23(b). From Figure 2.23(b) it appears (visually) that the FCM misclassification rate is high. This is mainly due to the choice of these features, which are not sufficiently homogeneous within each class, and distinct between classes to provide good discrimination.

250-100 150 channel 5

250 Figure 2.24 Scatter plot of channel 5 vs 6 of satellite image Figure 2.24 is a scatterplot of the two features (channels 5 and 6) used for the segmentation shown in Figure 2.23(c). Since the number of d a t a points is very large (512x699), to prevent clutter, only a subsample of the data set is shown, and in the subsample only two distinct clusters can be seen. The water region appears as the smaller and denser cluster, because in this region, there is relatively less variation in the intensity values in all 7 channels. The highly reflective areas that appear white in the image show u p as outliers in this mapping.

The larger cluster includes samples from all the remaining regions, and it is hard if not impossible to distinguish the remaining three classes within this cluster. If this data were to be used for classifier design (instead of clustering) we could tell from the scatterplot that

the features would not be sufficient to distinguish 4 classes. Other more complex features, such as texture, would be needed.

2.6 Comments and bibliography

Dans le document FUZZY MODELS AND ALGORITHMS FOR PATTERN RECOGNITION AND IMAGE PROCESSING (Page 135-144)