• Aucun résultat trouvé

When data are missing in a variable of a particular case, it is very important to fill this variable with some intuitive data, if possible. Adding a reasonable estimate of a suitable data value for this variable is better than leaving it blank. The operation of deciding what

A B C

Cool.clips.com

Cool.clips.com Cool.clips.com

FIGURE 4.2 Relationship be-tween the terms “accuracy” and

“precision.”

59

DATA UNDERSTANDING

I. HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING PROCESS

data to use to fill these blanks is calleddata imputation. This term means that you assign data to the blank based on some reasonable heuristic (a rule or set of rules). In deciding what values to use to fill blanks in the record, you should follow the cardinal rule of data impu-tation, “Do the least harm” (Allison, 2002).

Selection of the proper technique for handling missing values depends on making the right assumption about the pattern of “missingness” in the data set. If there is a strong pattern among the missing values of a variable (e.g., caused by a broken sensor), the variable should be eliminated from the model.

Assumption of Missing Completely at Random (MCAR)

The assumption of Missing Completely at Random (MCAR) is satisfied when the probability of missing values in one variable is unrelated to the value of the variable itself, or to values of any other variable. If this assumption is satisfied, then values of each variable can be considered to be a ran-dom sample of all values of this variable in the underlying population from which this data set was drawn. This assumption may be unreasonable (may be violated) when older people refuse to list their ages more often than younger people. On the other hand, this assumption may be reasonable when some variable is very expensive to measure and is measured for only a subset of the data set.

Assumption of Missing at Random (MAR)

The assumption of Missing at Random (MAR) is satisfied when the probability of a value’s being missing in one variable is unrelated to the probability of missing data in another vari-able, but may be related to the value of the variable itself. Allison (2002) considers this to be a weaker assumption than MCAR. For example, MAR would be satisfied if the probability of missing income was related to marital status, but unrelated to income within a marital sta-tus (Allison, 2002). If MAR is satisfied, the mechanism causing the missing data may be con-sidered to be “ignorable.” That is, it doesn’t matterwhyMAR occurred, onlythatit occurred.

Techniques for Imputing Data

Most statistical and data mining packages have some facility for handling missing values. Often, this facility is limited to simple recoding (replacing the missing value with some value). Some statistical tool packages (like SPSS) have more complete missing value modules that provide some multivariate tools for filling missing values. We will describe briefly a few of the techniques. Following the discussion of methods, some guidelines will be presented to help you choose which method to use.

List-wise (or case-wise) deletion.This means that the entire record is deleted from the anal-ysis. This technique is usually the default method used by many statistical and machine learning algorithms. This technique has a number of advantages:

• It can be used for any kind of data mining analysis.

• No special statistical methods are needed to accomplish it.

• This is the safest method when data are MCAR.

• It is good for data with variables that are completely independent (the effect of each variable on the dependent variable isnotaffected by the effect of any other variable).

• Usually, it is applicable to data sets suitable for linear regression and is even more appropriate for use with logistic and Poisson regression.

60 4. DATA UNDERSTANDING AND PREPARATION

Regardless of its safety and ease of use, there are some disadvantages to its use:

• You lose the nonmissing information in the record, and the total information content of your data set will be reduced.

• If data are MAR, list-wise deletion can produce biased estimates (if salary level depends positively on education level—that is, salary level rises as education level rises—then list-wise deletion of cases with missing salary level data will bias the analysis toward lower education levels).

Pair-wise deletion.This means that all the cases with values for a variable will be used to calculate the covariance of that variable. The advantage of this approach is that a linear regression can be estimated from only sample means and a covariance matrix (listing co-variances for each variable). Regression algorithms in some statistical packages use this method to preserve the inclusion of all cases (e.g., PROC CORR in SAS and napredict in R). The major advantage of pair-wise deletion is that it generates internally consistent metrics (e.g., correlation matrices). This approach is justified only if the data are MCAR.

If data are only MAR, this approach can lead to significant bias in the estimators.

Reasonable value imputation.Imputation of missing values with the mean of the nonmiss-ing cases is referred to often asmean substitution. If you can safely apply some decision rule to supply a specific value to the missing value, it may be closer to the true value than even the mean substitution would be. For example, it is more reasonable to replace a missing value for number of children with zero rather than replace it with the mean or the median number of children based on all the other records (many couples are childless). For some variables, filling blanks with means might make sense; in other cases, use of the median might more appropriate. So, you may have a variety of missing value situations, and you must have some way to decide which values to use for imputation. In SAS, this approach is facilitated by the availability of 28 missing value codes, which can be dedicated to differ-ent reasons for the missing value. Cases (rows) for each of these codes can be imputed with a different reasonable value.

Maximum Likelihood Imputation

The technique of maximum likelihood imputation assumes that the predictor variables are independent. It uses a function that describes the probability density map (analogous to a topographic map) to calculate the likelihood of a given missing value, using cases where the value is not missing. A second routine maximizes this likelihood, analogous to finding the highest point on the topographic map.

Multiple Imputation

Rather than just pick a value (like the mean) to fill blanks, a more robust approach is to let the data decide what value to use. This multiple imputation approach uses multiple variables to predict what values for missing data are most likely or probable.

Simple Random Imputation. This technique calculates a regression on all the nonmiss-ing values in all of the variables to estimate the value that is missnonmiss-ing. This approach tends to underestimate standard error estimates. A better approach is to do this multiple times.

61

DATA UNDERSTANDING

I. HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING PROCESS

Multiple Random Imputation.In these techniques, a simple random imputation is repeated multiple times (n-times). This method is more realistic because it treats regression parameters (e.g., means) as sample values within a data distribution. An elaboration of this approach is to perform multiple random imputationm-times with different data samples. The global mean imputed value is calculated across multiple samples and multiple imputations.

SAS includes a procedure MI to perform multiple imputations. The following is a simple SAS program for multiple imputation, using PROC MI:

PROC MI data¼<your data set>

out¼<your output file>

var<your variable list>

Output to the<your output file>are five data sets collated into one data set, each char-acterized by its own value for a new variable.imputation.

Another SAS PROC MIANALYZE can be used to analyze the output of PROC MI. It pro-vides an output column titled “Fraction Missing Information,” which shows an estimate of how much information was lost by imputing the missing values. Allison (2002) provides much greater detail and gives very practical instructions for how to use these two SAS PROCs for multiple imputation of missing values.

See Table 4.1 for guidelines to follow for choosing the right imputation techniques for a given variable.

Recommendations

1. If you have a lot of cases, delete records with missing values.

2. If you are using linear regression as a modeling algorithm and have few missing values, use pair-wise deletion.

3. If you are using SAS, use PROC MI.

4. If you have any insight as to what the value ought to be, fill missing values with reasonable values.

5. Otherwise, use mean imputation.

Most data mining and statistical packages provide a facility for imputation with mean values.