Data Abstraction - HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS

Data preparation for data mining may include some very complex rearrangements of your data set. For example, you extract into an intermediate data set a year of call records by month for a group of customers. This data set will have up to 12 records for each customer, which is really a time-series data set. Analyzing time-series data directly is complex. But there is an indirect way to do it by performing a reverse pivot on your intermediate data set.

Consider the telephone usage data in Table 4.2. These records are similar to those that could be extracted from a telephone company billing system. These records constitute a time-series of up to 12 monthly billings for each customer. You will notice the first customer was active for the entire 12 months (hence 12 records). But the second customer has only 9 records because this person left the company. In many industries (including telecommu-nications), this loss of business is called attrition orchurn. One of the authors of this book led the team that developed one of the first applications of data mining technology to the telecommunications industry, churn modeling in 1998. This model was based on Analytical Records created by reverse pivoting the time-series data set extracted from the billing systems of telephone companies.

This data set could be analyzed by standard statistical or machine learning time-series tools. Alternatively, you can create Analytical Records from this data set by doing a reverse pivot. This operation copies data from rows for a given customer and installs them in

66 4. DATA UNDERSTANDING AND PREPARATION

columns of the output record. A normal pivot does just the opposite by copying column data to separate rows. This output record for the reverse pivot would appear as the follow-ing record shows for the first 9 months:

Cust_ID MOU1 MOU2 MOU3 MOU4 MOU5 MOU6 MOU7 MOU8 MOU9

1 26 91 43 74 87 99 60 99 68

This record is suitable for analysis by all statistical and machine learning tools. Now that we have the time-dimensional data flattened out into an Analytical Record, we can re-express the elements of this record in a manner that will show churn patterns in the data. But in its present format, various customers could churn in any month. If we submit the current form of the Ana-lytical Record to the modeling algorithm, there may not be enough signal to relate churn to MOU and other variables in a specific month. How can we rearrange our data to intensify the signal of the churn patterns? We can take a page out of the playbook for analyzing radio signals by performing an operation analogous to signal amplification. We do this with our data set by deriving a set oftemporal abstractions,in which the values in each variable are related to the churn month. Instead of analyzing the relationship of churn (in whatever month it occurred)

TABLE 4.2 Telephone Company Billing System Data Extract. (Note:The “churn month”

for customer #2 is 9, or September.)

Cust_ID Month MOU Due $Paid $Balance

1 1 26 $21.19 $21.19 0

I. HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING PROCESS

to specific monthly values (e.g., January MOU), we relate the churn month to the MOU value for the previous month, and the month before the previous month, and so forth. Fig-ure 4.3 illustrates the power of these temporal abstractions for clarifying the churn signal to our visual senses, and it will appear more clearly to the data mining algorithm also.

Other Data Abstractions

In the experience of many data miners, some of the most predictive variables are those you derive yourself. The temporal abstractions discussed in the previous sections are an example of these derived variables. Other data abstractions can be created also. These abstractions can be classified into four groups (Lavrac et al., 2000):

• Qualitative abstraction:A numeric expression is mapped to a qualitative expression. For example, in an analysis of teenage customer demand, compared to that of others, customers with ages between 13 and 19 could be abstracted as a value of 1 to a variable

“teenager,” while others are abstracted to a value of 0.

• Generalization abstraction:An instance of an occurrence is mapped to its class. For example, in an analysis of Asian preferences, compared to non-Asian, listings of

“Chinese,” “Japanese,” and “Korean” in the Race variable could be abstracted to 1 in the Asian variable, while others are abstracted to a value of 0.

• Definitional abstraction:An instance in which one data element from one conceptual category is mapped to its counterpart in another conceptual category. For example, when combining data sets from different sources for an analysis of customer demand among African-Americans, you might want to map “Caucasian” in a demographic data set and “White Anglo-Saxon Protestant” in a sociological data set to a separate variable of “Non-black.”

• Temporal abstraction:See the preceding discussion.

Jan ………..…. Oct

No Pattern + reorganization = Pattern emerges!

Slice across records Variable = July Month

Line up by Churn Month

Variable = T-1 Month

= Churn Month

FIGURE 4.3 Demonstration of the power of temporal abstractions for “amplifying” the churn signal to the analysis system (our eyes and brains).

68 4. DATA UNDERSTANDING AND PREPARATION

Data miners usually refer to the first three types of data abstractions as forms of recoding. The fourth type, temporal abstraction, is not commonly used. However, the meth-odologies of several data mining tool vendors do have forms of temporal abstractions integrated into their design.

Recommendation:If you want to model time-series data but don’t want to use time-series algorithms (e.g., ARIMA), you can create temporal abstractions and model on them. Some tools (e.g., KXEN) provide a facility for creating these variables. Sometimes, they are called lag variables because the response modeled lags behind the causes of the response. The DVD contains a Perl program you can modify; it will permit you to create temporal abstractions from your time-series data.

Dans le document HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS (Page 102-105)