Data Evaluation Case Study: Estimating the Information Content Features

Data Evaluation (Step 2)

Question 5: Is data needed to cover a range of parameters?

4.6 Data Evaluation Case Study: Estimating the Information Content Features

This case study describes how a simple clustering algorithm can be used in a supervised learning mode to assist with feature selection. Because this is a very important aspect of data mining that really needs some automation, we will go into more detail than usual.

Recall that features are just measurements in some problem domain. For this dis-cussion, we will assume:

Figure 4.2 Pattern recognition and visualization.

When “meaning” can be determined for “clusters,” useful patterns and discriminators can sometimes be revealed.

1. The data is in a comma-delimited spreadsheet (a CSV file). Each row is a feature vector (one data samole); each cell is one feature.

2. The data are numeric.

3. Column 1 provides row numbers used to identify the records (1, 2, 3, etc.).

4. The last column contains a positive integer that is a ground truth assignment for the row. Therefore, this data can be used for supervised learning (Chapter 9).

When many columns of data are available (i.e., each vector has many features), choos-ing the right ones to use is hard for a number of reasons:

1. Lots of columns means lots of dimensions when viewed geometrically.

2. The data in the columns can interact in complicated ways. For example, two weak pieces of evidence together sometimes provide more information than one strong piece of evidence alone.

3. There are a huge number of possible combinations in which columns could be chosen/rejected as features for a data mining project, so it is time-consuming (or impractical) to check them all. For example, if there are 30 columns, there are 2³⁰-1 > one billion ways to choose which subset of features to use for mining/

modeling.

There are many ways to assess a subset of features for information content. A notional description of a Monte Carlo approach is now described. The information Figure 4.3 Clustered representation of selected feature set.

assessment begins by reading in the data to be analyzed, and computing the mean and standard deviation for each feature for each of the ground truth classes. That is, the mean and standard deviation are computed for each column for all the rows that are in ground truth class 1, giving the center and variability of the class 1 data; then, for class 2 data, and so on.

To determine which columns contain information useful for classification of the data into its ground truth classes, we test many possible subsets of the available col-umns while selecting subsets randomly, and keep track of which subset gives the best results for a weighted nearest neighbor classifier (described later). The process operates as follows:

Algorithm Phase A Step 1 Read in the data fi le.

Step 2 Segment into calibration, training, and validation fi les (row order randomized).

Step 3 Compute centers and standard deviations for each class in the calibration segment.

Algorithm Phase B Step 1 Select a subset of the columns to test (a clique)

Step 2 Use the centers and standard deviations computed in Phase A for the clique to assign each data point in the training segment to a class (weighted nearest-neighbor classifi er).

Step 3 Compute performance statistics for this clique, i.e., its classifi cation accuracy (% correct) on the training segment.

Repeat Phase B for many feature cliques. The features in the best clique (highest accuracy score on the test set) are the ones that, as a group, have the most useful infor-mation for classification of those tested. This winning team comprises our selected feature set. Figure 4.3 shows how cluster shape is related to how informative particular features are. Cluster 1 is tall and thin, and cluster 2 is short and wide.

A weighted nearest-neighbor classifier is based upon well-known statistical prin-ciples. It was chosen for this application for several reasons, but the most important is that no retraining is required when a new feature clique is to be evaluated; features not selected are ignored in the calculation. This makes it possible to run a large number of clique tests very quickly.

Consider a point located as position A. Should it be assigned to cluster 1 or cluster 2? A slightly harder example is depicted in Figure 4.4, along with an objective measure-ment that will help us answer this question.

To take the shape of each class into account, the standard deviation, denoted by the Greek letter Sigma () is computed for each class, along each feature (Figure 4.5). This is a built-in computation in many spreadsheets. The standard deviation expresses the variability in a data set, and is often used for data normalization (z-scores) and detec-tion of outliers in data (discussed later in this chapter).

Figure 4.4 Tolerance of variation in clusters.

Using the standard deviations in each feature for a class cluster, a weighted dis-tance can be computed from the center of that class to any point in the feature space (Figure 4.6). If the data cluster in a natural way, this weighted distance can be used to determine how a point should be classified: just assign it to the class that has the closest cluster in this weighted distance (Figure 4.7).

If the data classes do not form nice clusters, some pathological conditions can arise.

This method of evaluating features is fast and simple, but not perfect (Figure 4.8).

To create a numeric measure of the classification power of a subset of the available features, this very fast weighted nearest-neighbor classifier is run repeatedly on a cali-bration set with various sets of features, and the best collection is remembered. Also, if the same feature appears in many high-performing feature sets, we conclude that it is probably good. In this way, the clustering algorithm described here is used to game feature sets in a Monte Carlo fashion.

This particular application prints out a spreadsheet report giving the classification power of various feature sets (Figure 4.9). In the figure below, 1 means the column feature was present in that set, while 0 means it was not. The application has tried all 32 of the possible feature cliques for this 5-dimensional data. This output gives the per-formance measures for all of them so the user can see the value of including/excluding the various feature combinations.

Figure 4.5 Standard deviation and variability of data.

The performance of each feature clique is shown in a row of Figure 4.9; a 0 means the feature was excluded, and a 1 means the feature was included. For example, row 23 shows that it is possible to get 86.13% of the points correctly classified using only features 2, 3, and 5.

Dans le document P R A C T IC A L D A T A M IN IN G an co ck (Page 102-106)