Data Extraction - HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS

Now that you have your data in some form (let’s assume it is in flat-file format), how do you put all the pieces together? The challenge before you now is to create a combined data structure suitable for input to the data mining tool. Data mining tools require that all vari-ables be presented as fields in a record. Consider the following data extracts from a Name &

Address table and a Product table in the data warehouse:

File #1: Name & Address

Name Address City State Zipcode

John Brown 1234 E St. Chicago IL 60610 Jean Blois 300 Day St. Houston TX 77091 Neal Smith 325 Clay St. Portland OR 97201

File #2: Product

Name Address Product Sales Date John Brown 1234 E. St. Mower 1/3/2007 John Brown 1234 E. St. Rake 4/16/2006 Neal Smith 325 Clay St. Shovel 8/23/2005 Jean Blois 300 Day St. Hoe 9/28/2007

For the data mining tool to analyze names and products in the same analysis, you must organize data from each file to list all relevant items of information (fields) for a given cus-tomer on the same line of the output file. Notice that there are two records for John Brown in the Product table, each one with a different sales date. To integrate these records for data mining, you can create separate output records for each product sold to John Brown. This approach will work if you don’t need to use Product as a predictor of the buying behavior of John Brown. Usually, however, you do want to include fields like Product as predictors in the model. In this case, you must create separate fields for each record and copy the rele-vant data into them. The second process is called flatteningor denormalizing the database.

The resulting records looks like this:

Name Address City State Zipcode Product1 Product2 John Brown 1234 E. St. Chicago IL 60610 Mower Rake Neal Smith 325 Clay St. Portland OR 97201 Shovel

Jean Blois 300 Day St. Houston TX 77091 Hoe

DATA UNDERSTANDING

I. HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING PROCESS

In this case, the sales date was not extracted. All relevant data for each customer are listed in the same record. Sometimes, this is called the Customer Analytical Record, or CAR. In other industries (or modeling entities other than customers), this data structure may be referred to as just the Analytical Record. We will use the termAnalytical Recordfrom here on.

To create the Analytical Record, you might have to combine data in different fields in several different data extract files to form one field in the output file. You might have to combine data from several different fields into one field. Most data mining tools provide some data integration capabilities (e.g., merging, lookups, record concatenation, etc.).

A second activity in data integration is the transformation of existing data to meet your mod-eling needs. For example, you might want to recode variables or transform them mathematically to fit a different data distribution (for working with specific statistical algorithms). Most data mining tools have some capability to do this to prepare data sets for modeling.

Both of these activities are similar to processes followed in building data warehouses.

But there is a third activity followed in building data warehouses, which is missing in data preparation for data mining: loading the data warehouse. The extract, transform, and load activities in data warehousing are referred to by the acronymETL. If your data integration needs are rather complex, you may decide to use one of the many ETL tools designed for data warehousing (e.g., Informatica, Abinitio, DataFlux). These tools can process complex data preparation tasks with relative ease.

Recommendation:ETL tools are very expensive. Unless you plan to use such tools a lot, they will not be a cost-effective choice. Most data mining tools have some extraction and transform functions (data integration), but not load functions. It is probable that you can do most of your data integration with the data mining tool, or by using Excel, and a good text editor (MS Notepad will work).

Data Description

This activity is composed of the analysis of descriptive statistical metrics for individual variables (univariate analysis), assessment of relationships between pairs of variables (bi-variate analysis), and visual/graphical techniques for viewing more complex relationships between variables. Many books have been written on each of these subjects; they are a bet-ter source of detailed descriptions and examples of these techniques. In this handbook, we will provide an overview of these techniques sufficient to permit you to get started with the process of data preparation for data mining. Basic descriptive statistics include:

• Mean:Shows the average value; shows the central tendency of a data set

• Standard deviation:Shows the distribution of data around the mean

• Minimum:Shows the lowest value

• Maximum:Shows the highest value

• Standard deviation:Shows the distribution of data around the mean

• Frequency tables:Show the frequency distribution of values in variables

• Histograms:Provide graphical technique to show frequency values in a variable Analysis and evaluation of these descriptive statistical metrics permit you to deter-mine how to prepare your data for modeling. For example, a variable with relatively low

54 4. DATA UNDERSTANDING AND PREPARATION

mean and a relatively high standard deviation has a relatively low potential for predicting the target. Analysis of the minimum, maximum, and mean may alert you to the fact that you have some significant outlier values in the data set. For example, you might include the following data in a descriptive statistical report.

N Mean Min Max StDev

10000 9.769700 0.00 454.0000 15.10153

The maximum is 454, but the mean is only about 9.7, and the standard deviation is only about 15. This is a very suspicious situation. The next step is to look at a frequency table or histogram to see how the data values are distributed between 0 and 454. Figure 4.1 shows a histogram of this variable.

Histogram of NUM_SP1 CH_10K 75v*10000c

NUM_SP1 = 10000*9.1*normal(x, 11.1865, 15.3251)

0.0

NUM_SP1 0

1000 2000 3000 4000 5000 6000 7000

No of obs

273.0 245.7

218.4 191.1

163.8 136.5

109.2 81.9

54.6 27.3

FIGURE 4.1 Histogram of the distribution of NUM_SP1 variable, compared to a normal distribution.

DATA UNDERSTANDING

I. HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING PROCESS

The maximum value of 454 is certainly an outlier, even in this long-tailed distribution.

You might be justified in deleting it from the data set, if your interest is to use this variable as a predictor. But this is not all you can learn from this graphic. This distribution is skewed significantly to the right and forms a negative exponential curve. Distributions like this can-not be analyzed adequately by standard statistical modeling algorithms (like ordinary lin-ear regression), which assume a normal distribution (bell-shaped) shown in red on the histogram. This distribution certainly isnotnormal. The distribution can be modeled ade-quately, however, with logistic regression, which assumes a distribution like this. Thus, you can see that descriptive statistical analysis can provide valuable information to help you choose your data values and your modeling algorithm.

Recommendation: Acquire a good statistical package that can calculate the statistical metrics listed in this section. Most data mining packages will have some descriptive sta-tistical capabilities, but you may want to go beyond those capabilities. For simple descriptive metrics, the Microsoft Excel Analysis Tool Pack add-on will be sufficient.

To add the Analysis Tool Pack, click on Tools and select Data Analysis. An option for Data Analysis will be added to the Tools menu. Open your data set in Excel; click on the Data Analysis option in the Tools menu; and fill in the boxes for input range, output range, and descriptive statistics. For more robust descriptive tools, look atSTATISTICA and SPSS.

Dans le document HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS (Page 89-92)