Data Sampling - HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS

In data mining, data sampling serves four purposes:

1. It can reduce the number of data cases submitted to the modeling algorithm. In many cases, you can build a relatively predictive model on 10–20% of the data cases. After that, the addition of more cases has sharply diminishing returns. In some cases, like retail market-basket analysis, you need all the cases (purchases) available. But usually, only a relatively small sample of data is necessary.

This kind of sampling is calledsimple random sampling. The theory underlying this method is that each sample case selected has an equal chance of being selected as does any other case.

2. It can help you select only those cases in which the response patterns are relatively homogeneous.

If you want to model telephone calling behavior patterns, for example, you might judge that calling behaviors are distinctly different in urban and rural areas. Dividing your data set into urban and rural segments is calledpartitioningthe database. It is a good idea to build separate models on each partition.

When this partitioning is done, you should randomly select cases within each defined partition. Such a sampling is calledstratified random sampling. The partitions are the

“strata” that are sampled separately.

3. It can help you balance the occurrence of rare events for analysis by machine learning tools.

As mentioned earlier in this chapter, machine learning tools like neural nets and decision trees are very sensitive to unbalanced data sets. An unbalanced data set is one in which one category of the target variable is relatively rare compared to the other ones.

Balancing the data set involves sampling the rare categories more than average (oversampling) or sampling the common categories less often (undersampling).

DATA UNDERSTANDING

I. HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING PROCESS

4. Finally, simple random sampling can be used to divide the data set into three data sets for analysis:

a. Training set: These cases are randomly selected for use in training the model.

b. Testing set: These cases are used to assess the predictability of the model, before refining the model or adding model enhancements.

c. Validation set: These cases are used to test the final performance of the model after all modeling is done.

Some data miners define the second set as the validation set and the third set as the testing set. Whatever you prefer in nomenclature, this testing process should proceed in the manner described. The need for the second testing set is that it is not appropriate to report model performance on the basis of the second data set, which was used in the pro-cess of creating the model. That situation would create a logical tautology, or using a thing to describe itself. The validation data set should be kept completely separate from the iter-ative modeling process. We describe this iteriter-ative process in greater detail in Chapter 6.

Reduction of Dimensionality

Now that we have our Analytical Record (amplified, if necessary, with temporal abstrac-tions), we can proceed to weed out unnecessary variables. But how do you determine which variables are unnecessary to the model before you train the model? This is almost a Catch-22 situation. But fortunately, there are several techniques you can perform to give you some insights into identifying the proper variables to submit to the algorithm and which ones to delete from your Analytical Record.

Correlation Coefficients

One of the simplest ways to assess variable relationships is to calculate the simple corre-lation coefficients between variables. Table 4.3 is a correcorre-lation matrix, showing pair-wise correlation coefficients. These data have been used in many texts and papers as examples of predictor variables used to predict the target variable, crime rate.

From the correlation matrix in Table 4.3, we can learn two very useful things about our data set. First, we can see that the correlations of most of the variables with crime rate are

TABLE 4.3 Correlation Coefficients for Some Variables in the Boston Housing Data Set Correlations of Some Variables in the Boston Housing Data Set. Correlations in Red Are Significant

at the 95% Confidence Level Crime

Crime Rate 1.000000 0.406583 0.055892 0.379670 0.582764

Nonretail Bus Acres 0.406583 1.000000 0.062938 0.708027 0.720760

Charles River 0.055892 0.062938 1.000000 0.099176 0.035587

Dist to Empl Centers

0.379670 0.708027 0.099176 1.000000 0.534432

Property Tax Rate 0.582764 0.720760 0.035587 0.534432 1.000000

70 4. DATA UNDERSTANDING AND PREPARATION

relatively high and significant, but that for Charles River proximity is relatively low and insignificant. This means that Charles River may not be a good variable to include in the model. The other thing we can learn is that none of the correlations of the predictor vari-ables is greater than 0.90. If a correlation between two varivari-ables exceeded 0.90 (a common rule-of-thumb threshold), their effects would be toocollinearto include in the model. Collin-earity occurs when plots of two variables against the target lie on almost the same line. Too much collinearity among variables in a model (multicollinearity) will render the solution ill-behaved, which means that there is no unique optimum solution. Rather, there will be too much overlap in the effects of the collinear variables, making interpretation of the results problematic.

Chi-Square Automatic Interaction Detection (CHAID)

The CHAID algorithm is used occasionally as the final modeling algorithm, but it has a number of disadvantages that limit its effectiveness as a multivariate predictor. It is used more commonly for variable screening to reduce dimensionality. But even here, there is a problem of bias toward variables with more levels for splits, which can skew the interpre-tation of the relative importance of the predictors in explaining responses on the dependent variable (Breiman et al., 1984).

Despite the possible bias in variable selection, it is used commonly as a variable screen method in several data mining tools (e.g.,STATISTICAData Miner).

Principal Components Analysis (PCA)

Often, PCA is used to identify some of the strong predictor variables in a data set. PCA is a technique for revealing the relationships between variables in a data set by identifying and quantifying a group ofprincipal components.These principal components are composed of transformations of specific combinations of input variables that relate to a given output (or target) variable (Jolliffe, 2002). Each principal component accounts for a decreasing amount of the variations in the raw data set. Consequently, the first few principal compo-nents express most of the underlying structure in the data set. Principal compocompo-nents have been used frequently in studies as a means to reduce the number of raw variables in a data set (Fodor, 2002; Hall and Holmes, 2003). When this is done, the original variables are replaced by the first several principal components. This approach to the analysis of variable relation-ships does not specifically relate the input variables to any target variable. Consequently, the principal components may hide class differences in relation to the target variable (Hand et al., 2001). In one study of the application of PCA to face recognition, the principal compo-nents tended to express the wrong characteristics suitable for face recognition (Belhumeur et al., 1997). Therefore, you may see varying success with the uses of PCA for dimensionality reduction to create the proper set of variables to submit to a modeling algorithm.

Gini Index

The Gini Index was developed by the Italian statistician Corrado Gini in 1912, for the purpose of rating countries by income distribution. The maximum Gini Index¼1 would mean that all the income belongs to one country. The minimum Gini Index¼0 would mean that the income is evenly distributed among all countries. This index measures the degree 71

DATA UNDERSTANDING

I. HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING PROCESS

of unevenness in the spread of values in the range of a variable. The theory is that variables with a relatively large amount of unevenness in the frequency distribution of values in its range (a high Gini Index value) have a higher probability to serve as a predictor variable for another related variable. We will revisit the Gini Index in Chapter 5.

Graphical Methods

You can look at correlation coefficients to gain some insight into relationships between numeric variables, but what about categorical variables? All is not lost. Some data mining tools have specialized graphics for helping you to determine strength of relationships between these categorical variables. For example, SPSS Clementine provides the Web node that draws lines between categorical variables positioned on the periphery of a circle (Figure 4.4). The width of the connecting lines represents the strength of the relationship between the two variables. The following Web diagram shows a strong relationship between preferences of “No” for Diet Pepsi (located at about 4 p.m. on the periphery of the diagram) and No for Diet 7UP (located at about 2 p.m. in the diagram). There are no links between “Yes” for Diet Pepsi and Diet 7UP (5:30 p.m. on the diagram and 2:30 p.m., respectively). Therefore, you might expect that there might be a relatively strong relation-ship between the “No” preferences of these beverages.

Other common techniques used for reduction of dimensionality are

• Multidimensional scaling

• Factor analysis

• Singular value decomposition

• Employing the “Kernel Trick” to map data into higher-dimensional spaces. This approach is used in Support Vector Machines and other Kernel Learning Machines, like KXEN. (See Aizerman et al., 1964.)

D_7UP

FIGURE 4.4 Web diagram from SPSS Clementine.

72 4. DATA UNDERSTANDING AND PREPARATION

Recommendation.You can gain a lot of insight on relationships from simple operations, like calculating the means, standard deviations, minimum, maximum, and simple correla-tion coefficients. You can employ the Gini score easily by running your data through one of the Gini programs on the DVD. Finally, you can get a quick multivariate view of data relationships by training a simple decision tree and inferring business rules. Then you will be ready to proceed in preparing your data for modeling.

These techniques are described and included in many statistical and data mining packages.

Dans le document HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS (Page 105-109)