CROSS-INDUSTRY STANDARD PROCESS FOR DATA MINING

Like data mining, web usage mining may be viewed in the context of the Cross-Industry Standard Process for Data Mining (CRISP–DM). According to CRISP–DM, a given data mining project has a life cycle consisting of six phases, as illustrated in Figure 6.1. Note that the phase sequence is adaptive. That is, the next phase in the sequence often depends on the outcomes associated with the previous phase.

The most signiﬁcant dependencies between phases are indicated by the arrows. For example, suppose that we are in the modeling phase. Depending on the behavior and

Figure 6.1 CRISP–DM is an iterative, adaptive process.

CROSS-INDUSTRY STANDARD PROCESS FOR DATA MINING 145 characteristics of the model, we may have to return to the data preparation phase for further reﬁnement before moving forward to the model evaluation phase. The six phases are as follows.

1. Business understanding phase. The ﬁrst phase in the CRISP–DM standard process may also be termed theresearch understanding phase.

a.Clearly enunciate the project objectives and requirements in terms of the business or research unit as a whole.

b.Translate these goals and restrictions into the formulation of a data mining problem deﬁnition.

c.Prepare a preliminary strategy for achieving these objectives.

2. Data understanding phase a.Collect the data.

b.Use exploratory data analysis to familiarize yourself with the data, and dis-cover initial insights.

c.Evaluate the quality of the data.

d.If desired, select interesting subsets that may contain actionable patterns.

3. Data preparation phase

a.This labor-intensive phase covers all aspects of preparing the ﬁnal data set, used for all subsequent phases, from the initial raw data.

b.Select the cases and variables that you want to analyze and that are appropriate for your analysis.

c.Perform transformations on certain variables, if needed.

d.Clean the raw data so that they are ready for the modeling tools.

4. Modeling phase

a.Select and apply appropriate modeling techniques.

b.Calibrate model settings to optimize results.

c.Often, different techniques may be used for the same data mining problem.

d.Looping back to the data preparation phase may be required to bring the form of the data into line with the speciﬁc requirements of a particular data mining technique.

5. Evaluation phase

a.The modeling phase has delivered one or more models. These models must be evaluated for quality and effectiveness before we deploy them for use in the ﬁeld.

b.Determine whether the model in fact achieves the objectives set for it in phase 1.

c.Establish whether some important facet of the business or research problem has not been accounted for sufﬁciently.

d.Finally, come to a decision regarding use of the data mining results.

6. Deployment phase

a.Model creation does not signify completion of the project. We need to make use of models created according to business objectives.

b.Provide an example of simple deployment: Generate a report.

c.Provide an example of more complex deployment: Implement a parallel data mining process in another department.

d.For businesses, the customer often carries out the deployment based on the model.

For more on CRISP–DM, see Chapman et al. [1], Larose [2], or www.crisp-dm.org. In this section we demonstrate web usage mining through the CRISP–

DM context. In this chapter we examine the types of web log data that web us-age miners usually work with; this is part of the data understanding phase in CRISP–DM. In Chapter 7, we discuss data preparation for web usage mining, which is clearly part of the data preparation phase. In Chapter 8, we examine ex-ploratory data analysis for web usage mining, which is also part of the data under-standing phase. In Chapter 9, we look at several different modeling methods and brieﬂy discuss evaluative methods; these are part of the modeling and evaluation phases.

Another framework for web usage mining is that proposed by Srivastava et al.

[3]. This process consists of four phases: the input stage, the preprocessing stage, the pattern discovery stage, and the pattern analysis stage.

1. Input stage.At the input stage, three types of raw web log ﬁles are retrieved—

access logs, referrer logs, and agent logs—as well as registration information (if any) and information concerning the site topology. In this chapter we discuss these data sources and become familiar with the type of web log data used in web usage mining.

2. Preprocessing stage.The raw web logs do not arrive in a format conducive to fruitful data mining. Therefore, substantial data preprocessing must be ap-plied. The most common preprocessing tasks are (1) data cleaning and fil-tering, (2) de-spidering, (3) user identification, (4) session identification, and (5) path completion. In Chapter 7, we look more closely at each of these tasks.

3. Pattern discovery stage.Once these tasks have been accomplished, the web data are ready for the application of statistical and data mining methods for the purpose of discovering patterns. These methods include (1) standard statisti-cal analysis, (2) clustering algorithms, (3) association rules, (4) classiﬁcation algorithms, and (5) sequential patterns. We examine methods and models for pattern discovery in Chapters 8 and 9.

4. Pattern analysis stage.Not all of the patterns uncovered in the pattern discovery stage would be considered interesting or useful. For example, an association rule for an online movie database that found “If Page=Sound of Music then Section=Musicals” would not be useful, even with 100% conﬁdence, since this wonderful movie is, of course, a musical. Hence, in the pattern analysis stage,

Dans le document DATA MINING THE WEB (Page 162-165)