Part One Introduction 1 3 Contents

(1)

It is the best book on data mining so far, and I would defln,(teJ�_.,tdiiPt my course. The book is very C011Jprehensive and cove� all of

topics and algorithms of which I am aware. The depth of CO!Irer•liM topic or method is exactly right and appropriate. Each a/grorirtmti �r�

in pseudocode that is s , icient for any interested readers to working implementation in a computer language of their choice.

-Michael H Huhns, Umversity of �UDilCiii

Discussion on distributed, parallel, and incremental algorithms is outst:tlftfi!tr··· '��

-Z an Obradovic, Temple Univef'Sf1tv

Margaret Dunham offers the experienced data base professional or graduate level Computer Science student an introduction to the full spectrum of Data Mining concepts and algorithms. Using a database perspective throughout, Professor Dunham examines algorithms, data structures, data types, and complexity of algorithms and space. This text emphasizes the use of data mining concepts in real-world applications with large database components.

KEY FEATURES:

.. Covers advanced topics such as Web Mining and Spatialrremporal mining Includes succinct coverage of Data Warehousing, OLAP, Multidimensional Data, and Preprocessing

Provides case studies

Offers clearly written algorithms to better understand techniques Includes a reference on how to use Prototypes and DM products

Prentice Hall

Upper Saddle River, NJ 07458

www. prenhall.com

2517 227

1

Hail

roductoty nd Advanced Topics

MARGARE�f H. DUNHJ

(2)

�

iographic Notes .

Part Two Core Topics

4 Classification

4.1 Introduction . . . 4.1.1 Issues in Classification . 4.2 Statistical-Based Algorithms ..

4.2.1 Regression . . . 4.2.2 Bayesian Classification . 4.3 Distance-Based Algorithms ..

4.3.1 Simple Approach ....

4.3.2 K Nearest Neighbors 4.4 Decision Tree-Based Algorithms

4.4.1 ID3 ... . 4.4.2 C4.5 and C5 .0 .... . 4.4.3 CART . . . . 4.4.4 Scalable D T Techniques 4.5 Neural Network-Based Algorithms

4.5.1 Propagation . . . . 4.5.2 NN Supervised Learning . . 4.5.3 Radial Basis Function Networks . 4.5.4 Perceptrons . . . . 4.6 Rule-Based Algorithms . . . .

4.6.1 Generating Rules from a DT . . 4.6.2 Generating Rules from a Neural Net 4.6.3 Generating Rules Without a DT or NN 4.7 Combining Techniques

4.8 Summary . . . 4.9 Exercises . . . . 4.10 Bibliographic Notes .

46 4 6 4 7 4 7 5 1 5 2 5 4 5 5 5 7 5 8 6 1 6 4 67- 7 0 7 1

73 75 7 5 7 7 8 0 8 0 8 6 8 9 8 9 9 0 92 9 7 100 102 103 103 105 106 112 112 114 114 115 116 119 12 1 12 1 122

5 Clustering 5.1 Introduction

5.2 Similarity and Distance Measures 5.3 Outliers . . . . 5.4 Hierarchical Algorithms . . . . . 5.4.1 Agglomerative Algorithms . 5.4.2 Divisive Clustering . . . . 5.5 Partitional Algorithms . . . .

5.5.1 Minimum Spanning Tree . .

5.5.2 Squared Error Clustering Algorithm.

5.5.3 K -Means Clustering ... . . 5.5.4 Nearest Neighbor Algorithm . 5.5.5 PAM Algorithm . . .. . . . 5.5.6 Bond Energy Algorithm . . . 5.5.7 Clustering with Genetic Algorithms . 5.5.8 Clustering with Neural Networks 5.6 Clustering Large Databases

5.6.1 BIRCH . .. . . . 5.6.2 DBSCAN . . . . 5.6.3 CURE Algorithm . .

5.7 Clustering with Categorical Attributes . 5.8 Comparison . . . . .

5.9 Exercises . . . . 5.10 Bibliographic Notes .

6 Association Rules 6.1 Introduction . 6.2 Large Itemsets 6.3 Basic Algorithms

6.3.1 Apriori Algorithm 6.3.2 Sampling Algorithm 6.3.3 Partitioning . . . 6.4 Parallel and Distributed Algorithms

6.4.1 Data Parallelism 6.4.2 Task Parallelism 6.5 Comparing Approaches . 6.6 Incremental Rules . . . .

6.7 Advanced Association Rule Techniques 6. 7.1 Generalized Association Rules . 6.7.2 Multiple-Level Association Rules 6.7.3 Quantitative Association Rules 6.7.4 Using Multiple Minimum Supports 6. 7.5 Correlation Rules . . . .

6.8 Measuring the Quality of Rules 6.9 Exercises . . .

6.10 Bibliographic Notes . . . .. . .

Contents vii

125 12 5 12 9 13 0 13 1 13 2 13 8 13 8 13 8 13 9 14 0 14 2 14 2 14 5 14 6 14 7 14 9 15 0 15 2 15 4 15 7 15 9 16 1 16 1

164 16 4 16 7 16 9 16 9 17 3 17 7 17 8 17 9 18 0 18 1 18 4 18 4 18 4 18 5 18 5 18 6 18 7 18 8 19 0 19 1

(4)

viii Contents

Part Three Advanced Topics 7

8

Web Mining 7.1 Introduction 7.2 Web Content Minirig

7.2.1 Crawlers ..

7.2.2 Harvest System . 7.2.3 Virtual Web View 7.2.4 Personalization 7.3 Web Structure Mining

7.3.1 PageRank . 7.3.2 Clever ...

7.4 Web Usage Mining . . 7.4.1 Preprocessing . 7.4.2 Data Structures . 7.

4

^.3 Pattern Discovery 7.4.4 Pattern Analysis 7.5 Exercises ^• ⁰ ^• ⁰ ^• ^• 7.6 Bibliographic Notes . Spatial Mining

8.1 8.2

8.3 8.4

8.5 8.6

8.7

8.8 8.9

Introduction ^• ^• ^• ^• ^• ⁰ Spatial Data Overview 8.2.1 Spatial Queries

8.2.2 Spatial Data Structures . 8.2.3 Thematic Maps . . . . . 8.2.4 Image Databases . . . . Spatial Data Mining Primitives Generalization and Specialization

8.4.1 8.4.2 8.4.3 8.4.4

Progressive Refinement Generalization ..

Nearest Neighbor . STING ⁰ ^• ^• ^• ^• ^• Spatial Rules ...^..

8.5.1 Spatial Association Rules Spatial Classification Algorithm

8.6.1 ID3 Extension ...^. 8.6 .2 Spatial Decision Tree Spatial Clustering Algorithms .

8.7.1 CLARANS Extensions .

8.7.2 SD(CLARANS)

8.7.3 DBCLASD .

8.7.4 BANG ....

8.7.5 WaveCluster 8.7.6 Approximation Exercises ^• ^• ^• ⁰ ^• ^• Bibliographic Notes . .

193 195 19 5 19 7 19 8 2 01 2 01 2 02 204 2 05 2 05- 2 06 2 08 2 09 2 11 2 18 2 18 2 19

221 22 1 222 222 22 3 22 6 22 6 22 7 22 8 22 8 22 9 2 3 1 2 3 1 2 3 3 2 3 4 2 3 6 2 3 6 2 3 6 2 3 7 2 3 8 2 3 9 2 4 0 2 4 1 2 41 2 4 1 2 4 3 2 4 3

9 Temporal Mining

· 9.1 Introduction . . . . . . . . . 9.2 Modeling Temporal Events . 9.3 Time Series . . . .

9.3.1 Time Series Analysis . 9.3.2 Trend Analysis 9.3.3 Transformation 9.3.4 Similarity . 9o3 o5 Prediction 0 0 0 9.4 Pattern Detection .. 0

9.4.1 String Matching 9.5 Sequences . . .

9.5.1 AprioriAll .. o

9.5.2 SPADE ... o

9.5.3 Generalization 9.5.4 Feature Extraction 9.6 Temporal Association Rules

9.6.1 Intertransaction Rules 9.6.2 Episode Rules ....

9.6 03 Trend Dependencies 9.6.4 Sequence Association Rules 9.6.5 Calendric Association Rules . 907 Exercises ⁰.. ⁰ ⁰0

9.8 Bibliographic Notes .

APPENDICES

Contents ix 245 2 4 5 2 4 8 2 52 2 5 2 2 5 3 2 5 5 2 5 5 2 5 6 2 5 7 2 5 7 2 6 0 2 6 2 2 6 2 2 6 4 2 6 6 2 6 6 2 6 7 2 6 7 2 6 8 2 7 0 2 7 1 2 7 2 2 7 2

A Data Mining Products

A.1 Bibliographic Notes . • 0 • • 0 • • • • • • • • • 0 0 • • • • • • • • 0 • • •

274 2 8 9

B Bibliography 290

Index 305

About the Author 315

(5)

Preface

Data doubles about every year, but useful information seems to be decreasing. The area of data mining has arisen over the last decade to address this problem. It has become not only an important research area, but also one with large potential in the real world.

Current business users of data mining products achieve millions of dollars a year in savings by using data minif\g techniques to reduce the cost of day to day business operations. Data mining techniques are proving to be extremely useful in detecting and predicting terrorism.

The purpose of this book is to introduce the reader to various data mining con

cepts and algorithms. The book is concise yet thorough in its coverage of the many data mining topics. Clearly written algorithms with accompanying pseudocode are used to describe approaches. A database perspective is used throughout. This means that I examine algorithms, data structures, data types, and complexity of algorithms and space.

The emphasis is on the use of data mining concepts in real-world applications with large database components.

Data mining research and practice is in a state similar to that of databases in the 1960s. At that time applications programmers had to create an entire database environ

ment each time they wrote a program. With the development of the relational data model, query processing and optimization techniques, transaction management strategies, and ad hoc query languages (SQL) and interfaces, the current environment is drastically differ

ent. The evolution of data mining techniques may take a similar path over the next few decades, making data mining techniques easier to use and develop. The objective of this book is to help in this process.

The intended audience of this book is either the expeiienced database professional who wishes to learn more about data mining or graduate level computer science students who have completed at least an introductory database course. The book is meant to be used as the basis of a one-semester graduate level course covering the basic data mining concepts. It may also be used as reference book for computer professionals and researchers.

Introduction

I

Chl Introduction

1-

I

Ch2 Related Concepts

I

Core Topics

rl

Ch4 Classification

I I

Ch3 Data Mining Techniques

I

r-H

Ch5 Clustering Advanced Topics

H

Ch6 Association Rules

I I

Ch7 Web Mining

1-

I

Ch8 Spatial Mining

1-r-

_Appendix

I

Ch9 Temporal Mining

1- y

Data Mining Products

xi

(6)

xii Preface

The book is divided into four major parts: Introduction, Core Topics, Advanced Topics, and Appendix. The introduction covers background information needed to under

stand the later material. In addition, it examines topics related to data mining such as OLAP, data warehousing, information retrieval, and machine learning. In the first chapter of the introduction I provide a very cursory overview of data mining and how it relates to the complete KDD process. The second chapter surveys topics related to data min

ing. While this is not crucial to the coverage of data mining and need not be read to understand later chapters, it provides the interested reader with an understanding and appreciation of how data mining concepts relate to other areas. To thoroughly under

stand and appreciate the data mining algorithms presented in subsequent chapters, it is important that the reader realize that data mining is not an isolated subject. It has its basis in many related disciplines that are equally important on their own. The third chapter in this part surveys some techniques used to implement data mining algorithms. These include statistical techniques, neural networks, and decision trees. This part of the book provides the reader with an understanding of the basic data mining concepts. It also serves as

J

standalone survey of the entire data mining area.

The Core Topics covered are classification, clustering, and association rules. I view these as the major data mining functions. Other data mining concepts (such as prediction, regression, and pattern matching) may be viewed as special cases of these three. In each of these chapters I concentrate on coverage of the most commonly used algorithms of each type. Our coverage includes pseudocode for these algorithms, an explanation of them and examples illustrating their use.

The advanced topics part looks at various concepts that complicate data mining applications. I concentrate on temporal data, spatial data, and Web mining. Again, algo

rithms and pseudocode are provided.

In the appendix, production data mining systems are surveyed. I will keep a more up to data list on the Web page for the book. I thank all the representatives of the various companies who helped me correct and update my descriptions of their products.

All chapters include exercises covering the material in that chapter. In addition to conventional types of exercises that either test the student's understanding of the material or require him to apply what he has learned. I also include some exercises that require implementation (coding) and research. A one-semester course would cover the core topics and one or more of the advanced ones.

ACKNOWLEDGMENTS

Many people have helped with the completion of this book. Tamer Ozsu provided initial advice and inspiration. My dear friend Bob Korfhage introduced me to much of computer science, including pattern matching and information retrieval. Bob, I think of you often.

I particularly thank my graduate students for contributing a great deal to some of the original wording and editing. Their assistance in reading and commenting on earlier drafts has been invaluable. Matt McBride helped me prepare most of the original slides, many of which are still available as a companion to the book. Yongqiao Xiao helped write much of the material in the Web mining chapter. He also meticulously reviewed an earlier draft of the book and corrected many mistakes. Le Gruenwald, Zahid Hossain, Yasemin Seydim, and Al Xiao performed much of the research that provided information found concerning association rules. Mario Nascimento introduced me to the world of

Preface xiii temporal databases, and I have used some of the information from his dissertation in the temporal mining chapter. Nat Ayewah has been very patient with his explanations of hidden Markov models and helped improve the wording of that section. Zhigang Li has introduced me to the complex world of time series and helped write the solutions manual. I've learned a lot, but still feel a novice in many of these areas.

The students in my CSE 8 3 3 1 class (Spring 1 9 9 9 , Fall 2000, and Spring 2002) at SMU have had to endure a great deal. I never realized how difficult it is to clearly word algorithm descriptions and exercises until I wrote this book. I hope they learned something even though at times the continual revisions necessary were, I'm sure, frustrating. Torsten Staab wins the prize for find�ng and correcting the most errors. Students in my CSE8 3 3 1 class during Spring 2002 helped me prepare class notes and solutions to the exercises. I thank them for their input.

My family has been extremely supportive in this endeavor. My husband, Jim, has been (as always) understanding and patient with my odd work hours and lack of sleep.

A more patient and supportive husband could not be found. My daughter Stephanie has put up with my moodiness caused by lack of sleep. Sweetie, I hope I haven't been too short-tempered with you (ILYMMTYLM). At times I have been impatient with Kristina but you know how much I love you. My Mom, sister Martha, and brother Dave as always are there to provide support and love.

Some of the research required for this book was supported by the National Science Foundation under Grant No. IIS-9 8 208 4 1. I would finally like to thank the reviewers (Michael Huhns, Julia Rodger, Bob Cimikowski, Greg Speegle, Zoran Obradovic, T.Y. Lin, and James Buckly) for their many constructive comments. I tried to implement as many of these I could.

(7)

PART ONE

INTRODUCTION

(8)

CHAPTER 1

Introduction

1.1 BASIC DATA MINING TASKS

1.2 DATA MINING VERSUS KNOWLEDGE OISCOVERY IN DATABASES 1.3 DATA MINING ISSUES

1.4 DATA MINING METRICS

1.5 SOCIAL IMPLICATIONS OF DATA MINING 1.6 DATA MINING FROM A DATABASE PERSPECTIVE 1.7 THE FUTURE

1.8 EXERCISES

1.9 BIBLIOGRAPHIC NOTES

The amount of data kept in computer files and databases is growing at a phenomenal rate.

At the same time, the users of these data are expecting mo!l'e sophisticated information from them. A marketing manager is no longer satisfied with a simple listing of marketing contacts, but wants detailed information about customers' past purchases as well as pre

dictions of future purchases. Simple structured/query language queries are not adequate to support these increased demands for information. Data mining steps in to solve these needs. Data mining is often defined as finding hidden information in a database. Alterna

tively, it has been called exploratory data analysis, data driven discovery, and deductive learning.

Traditional database queries (Figure 1.1), access a database using a well-defined query stated in a language such as SQL. The output of tht: query consists of the data from the database that satisfies the query. The output is usually a subset of the database, but it may also be an extracted view or may contain aggregations. Data mining access of a database differs from this traditional access in several ways:

• Query: The query might not be well formed or precisely stated. The data miner might not even be exactly sure of what he wants to see.

• Data: The data accessed is usually a different version from that of the original operational database. The data have been cleansed and modified to better support the mining process.

• Output: The output of the data mining query probably is not a subset of the database. Instead it is the output of some analysis of the contents of the database.

The current state of the art of data mining is similar to that of database query processing in the late 1960s and early 1970s. Over the next decade there undoubtedly will be great 3

(9)

4 _Chapter1 Introduction SQL

Q� I

^DBMS

1

^-

(

^Ds

}

Results

FIGURE 1.1: Database access.

strides in extending the state of the art with respect to data mining. We probably will see the development of "query processing" models, standards, and algorithms targeting the data mining applications. We probably will also see new data structures designed for the storage of databases being used for data mining applications. Although data mining is currently in its infancy, over the last decade we have seen a proliferation of mining algorithms, applications, and algorithmic approaches. Example 1.1 illustrates one such application.

EXAMPL�1.1

Credit card companies must determine whether to authorize credit card purchases. Sup

pose that based on past historical information about purchases, each purchase is placed into one of four classes: (1) authorize, (2) ask for further identification before authoriza

tion, (3) do not authorize, and (4) do not authorize but contact police. The data mining functions here are twofold. First the historical data must be examined to determine how the data fit into the four classes. Then the problem is to apply this model to each new purchase. Although the second part indeed may be stated as a simple database query, the first part cannot be.

Data mining involves many different algorithms to accomplish different tasks. All of these algorithms attempt to fit a model to the data. The algorithms examine the data and determine a model that is closest to the characteristics of the data being examined.

Data mining algorithms can be characterized as consisting of three parts:

• Model: The purpose of the algorithm is to fit a model to the data.

• Preference: Some criteria must be used to fit one model over another.

• Search: All algorithms require some technique to search the data.

In Example 1.1 the data are modeled as divided into four classes. The search requires examining past data about credit card purchases and their outcome to determine what criteria should be used to define the class structure. The preference will be given to criteria that seem to fit the data best. For example, we probably would want to authorize a credit card purchase for a small amount of money with a credit card belonging to a long-standing customer. Conversely, we would not want to authorize the use of a credit card to purchase anything if the card has been reported as stolen. The search process requires that the criteria needed to fit the data to the classes be properly defined.

As seen in Figure 1.2, the model that is created can be either predictive or descrip

tive in nature. In this figure, we show under each model type some of the most common data mining tasks that use that type of model.

1.1

Predictive

Section 1.1 Data mining

Basic Data Mining Tasks 5

---- Descriptive

---�

Classification Regression Time series Prediction Clustering Summarization Association Sequence

analysis rules discovery

FIGURE 1.2: Data mining models and tasks.

A predictive model makes a prediction about values of data using known results found from different data. Predictive modeling may be made based on the use of other historical data. For example, a credit card use might be refused not because of the user's own credit history, but because the current purchase is similar to earlier purchases that were subsequently found to be made with stolen cards. Example 1.1 uses predictive modeling to predict the credit risk. Predictive model data mining tasks include classification, regression, time series analysis, and prediction. Prediction may also be used to indicate a specific type of data mining function, as is explained in section 1.1.4.

A descriptive model identifies patterns or relationships in data. Unlike the predictive model, a descriptive model serves as a way to explore the properties of the data examined, not to predict new properties. Clustering, summarization, association rules, and sequence discovery are usually viewed as descriptive in nature.

BASIC DATA MINING TASKS

In the following paragraphs we briefly explore some of the data mining functions. We follow the basic outline of tasks shown in Figure 1.2. This list is not intended to be exhaustive, but rather illustrative. Of course, these individual tasks may be combined to obtain more sophisticated data mining applications.

1.1.1

i Classification

Classification maps data into predefined groups or classes. It is often referred to as supervised learning because the classes are determined before examining the data. Two examples of classification applications are determining whether to make a bank loan and identifying credit risks. Classification algorithms require that the classes be defined based on data attribute values. They often describe these classes by looking at the character

istics of data already known to belong to the classes. Pattern recognition is a type of classification where an input pattern is classified into one of several classes based on its similarity to these predefined classes. Example 1.1 illustrates a general classification problem. Example 1.2 shows a simple example of pattern recognition.

EXAMPLE 1.2

An airport security screening station is used to determine: if passengers are potential terrorists or criminals. To do this, the face of each passenger is scanned and its basic pattern (distance between eyes, size and shape of mouth, shape of head, etc.) is identified.

(10)

1.1.2

6 Chapter 1 Introduction

This pattern is compared to entries in a database to see if it matches any patterns that are associated with known offenders.

Regression

Regression is used to map a data item to a real valued prediction vari

�

ble. In ac

�

^al

ity, regression involves the learning of the function that does t

�

is mappi�g. Regre

�

^si

?

ⁿ

assumes that the target data fit into some known type of functiOn (e.g., linear, logistic, etc.) and then determines the best function of this type that models the given data. �orne type of error analysis is used to determine which function is "best." .standard hnear regression, as illustrated in Example 1.3, is a simple example of regressiOn.

EXAMPLE 1.3

A college ptofessor wishes to reach a certain level of savings before. her retirement.

Periodically, she predicts what her retirement savings will be based on Its

�

^urre

�

^{t value}

and several past values. She uses a simple linear regression fo�ula to .predict this. value by fitting past behavior to a linear function and then using this functiOn to

?

redict the values at points in the future. Based on these values, she then alters her mvestment portfolio.

1.1.3 Time Series Analysis

With time series analysis, the value of an attribute is examined as it varies over time. The values usually are obtained as evenly spaced time points (daily, weeki

�

, hourly, etc.). A time series plot (Figure 1.3), is used to visualize the time series. In this figure you can easily see that the plots for Y and Z have similar behavior, while X appears to have less volatility. There are three basic functions performed in time series. analysis: In on

�

^case,

distance measures are used to determine the similarity between different tlme senes. In the second case, the structure of the line is examined to determine (and perhaps classi

�

^y)

its behavior. A third application would be to use the historical time series plot to predict future values. A time series example is given in Example 1.4.

EXAMPLE 1.4

Mr. Smith is trying to determine whether to purchase stock from Companies X, Y, or z. For a period of one month he charts the daily stock price for ea

�

h co�pany.

Figure 1.3 shows the time series plot that Mr. Smith ha

�

^gene

�

ated. Usmg this and similar information available from his stockbroker, Mr. Sllllth decides to purchase stock X because it is less volatile while overall showing a slightly larger relative amount of growth than either of the other stocks. As a matter of fact, the

�

^to.cks

�

^or^Y^and^{Z have}

a similar behavior. The behavior of Y between days 6 ^and²⁰IS Identical to that for Z between days 13 ^{and 27.}

Section 1.1

FIGURE 1.3: Time series plots.

1.1.4 Prediction

Basic Data Mining Tasks 7

---o-X

__ .,___ y --z

Many real-world data mining applications can be seen as predicting future data states based on past and current data. Prediction can be viewed as a type of classification. (Note:

This is a data mining task that is different from the prediction model, although the pre

diction task is a type of prediction model.) The difference is that prediction is predicting a future state rather than a current state. Here we are referring to a type of application rather than to a type of data mining modeling approach, as discussed earlier. Prediction applications include flooding, speech recognition, machine learning, and pattern recog

nition. Although future values may be predicted using time series analysis or regression techniques, other approaches may be used as well. Example 1.5 illustrates the process.

EXAMPLE 1.5

Predicting flooding is a difficult problem. One approach uses monitors placed at various

; points in the river. These monitors collect data relevant to flood prediction: water level, ' rain amount, time, humidity, and so on. Then the water level at a potential flooding point in the river can be predicted based on the data collected by the sensors upriver from this point. The prediction must be made with respect to the time the data were collected.

1.1.5 Clustering

Clustering is similar to classification except that the groups are not predefined, but rather defined by the data alone. Clustering is alternatively referred to as unsupervised learn

ing or segmentation. It can be thought of as partitioning or segmenting the data into groups that might or might not be disjointed. The clustering is usually accomplished by determining the similarity among the data on predefined attributes. The most similar data are grouped into clusters. Example 1.6 provides a simple clustering example. Since the clusters are not predefined, a domain expert is often required to interpret the meaning of the created clusters.

(11)

EXAMPLE 1.6

A certain national department store chain creates special catalogs targeted to various demographic groups based on attributes such as income, location, and physical charac

teristics of potential customers (age, height, weight, etc.). To determine the target mailings of the various catalogs and to assist in the creation of new, more specific catalogs, the company performs a clustering of potential customers based on the determined attribute values. The results of the clustering exercise are then used by management to create special catalogs and distribute them to the correct target population based on the cluster for that catalog.

A special type of clustering is called segmentation. With segmentation a database is partitioned into disjointed groupings of similar tuples called segments. Segmentation is often viewed as being identical to clustering. In other circles segmentation is viewed as a specilic type of clustering applied to a database itself. In this text we use the two terms, clustering and segmentation, interchangeably.

1.1.6 Summarization

Summarization maps data into subsets with associated simple descriptions. Summariza

tion is also called characterization or generalization. It extracts or derives representative information about the database. This may be accomplished by actually retrieving portions of the data. Alternatively, summary type information (such as the mean of some numeric attribute) can be derived from the data. The summarization succinctly characterizes the contents of the database. Example 1.7 illustrates this process.

EXAMPLE 1.7

One of the many criteria used to compare universities by the U.S. News & World Report is the average SAT or AC T score [GM99]. This is a summarization used to estimate the type and intellectual level of the student body.

1.1.7 Association Rules

Link analysis, alternatively referred to as affinity analysis or association, refers to the data mining task of uncovering relationships among data. The best example of this type of application is to determine association rules. An association rule is a model that identifies specific types of data associations. These associations are often used in the retail sales community to identify items that are frequently purchased together. Example 1.8 illustrates the use of association rules in market basket analysis. Here the data analyzed consist of information about what items a customer purchases. Associations are also used in many other applications such as predicting the failure of telecommunication switches.

EXAMPLE 1.8

A grocery store retailer is trying to decide whether to put bread on sale. To help determine the impact of this decision, the retailer generates association rules that show what other

Section 1.2 Data Mining Versus Knowledge Discovery in Databases 9 products are frequently purchased with bread. He finds that 60% of the time that bread is sold so are pretzels and that 70% of the time jelly is also sold. Based on these facts, he tries to capitalize on the association between bread, pretzels, and jelly by placing some pretzels and jelly at the end of the aisle where the bread is placed. In addition, he decides not to place either of these items on sale at the same time.

Users of association rules must be cautioned that these are not causal relation

ships. They do not represent any relationship inherent in the actual data (as is true with functional dependencies) or in the real world. There probably is no relationship between bread and pretzels that causes them to be purchased together. And there is no guarantee that this association will apply in the future. However, association rules can be used to assist retail store management in effective advertising, marketing, and inventory control.

1.1.8 Sequence Discovery

Sequential analysis or sequence discovery is used to determine sequential patterns in data.

These patterns are based on a time sequence of actions. These patterns are similar to associations in that data (or events) are found to be related, but the relationship is based on time. Unlike a market basket analysis, which requires the items to be purchased at the same time, in sequence discovery the items are purchased over time in some order.

Example 1.9 illustrates the discovery of some simple patterns. A similar type of discovery can be seen in the sequence within which data are purchased. For example, most people who purchase CD players may be found to purchase CDs within one week. As we will see, temporal association rules really fall into this category.

EXAMPLE 1.9

The Webmaster at the XYZ Corp. periodically analyzes the Web log data to determine how users of the XYZ's Web pages access them. He is interested in determining what sequences of pages are frequently accessed. He determines that 70 percent of the users of page A follow one of the following patterns of behavior: (A, B, C) or (A, D, B, C) or (A, E, B, C). He then determines to add a link directly from page A to page C.

1.2 DATA M I NI NG VERSU S KNOWLEDGE DISCOVERY I N DATABASES

The terms knowledge discovery in databases (KDD) and data mining are often used interchangeably. In fact, there have been many other names given to this process of discovering useful (hidden) patterns in data: knowledge extraction, information discovery, exploratory data analysis, information harvesting, and unsupervised pattern recognition.

Over the last few years KDD has been used to refer to a process consisting of many steps, while data mining is only one of these steps. This is the approach taken in this book. The following definitions are modified from those found in [FPSS96c, FPSS96a].

DEFINITION 1.1. Knowledge discovery in databases (KDD) is the process of finding useful information and patterns in data.

DEFINITION 1.2. Data mining is the use of algorithms to extract the information and patterns derived by the KDD process.

(12)

The KDD process is often said to be nontrivial; however, we take the larger view that KDD is an all-encompassing concept. A traditional SQL database query can be viewed as the data mining part of a KDD process. Indeed, this may be viewed as som�what simple and trivial. However, this was not the case 30 years ago. If we were to advance 30 years into the future, we might find that processes thought of today as nontrivial and complex will be viewed as equally simple. The definition of KDD includes the keyword useful. Although some definitions have included the term "potentially useful," we believe that if the information found in the process is not useful, then it really is not information.

Of course, the idea of being useful is relative and depends on the individuals involved.

KDD is a' process that involves many different steps. The input to this process is the data, and the output is the useful information desired by the users. However, the objective may be unclear or inexact. The process itself is interactive and may require much elapsed time. To ensure the usefulness and accuracy of the results of the process, interaction throughout the process with both domain experts and technical experts might be needed. Figure 1.4 (modified from [FPSS96c]) illustrates the overall KDD process.

frhe KDD process consists of the following five steps [FPSS96c]:

• Selection: The data needed for the data mining process may be obtained from many different and heterogeneous data sources. This first step obtains the data from various databases, files, and nonelectronic sources.

• Preprocessing: The data to be used by the process may have incorrect or miss

ing data. There may be anomalous data from multiple sources involving different data types and metrics. There may be many different activities performed at this time. Erroneous data may be corrected or removed, whereas missing data must be supplied or predicted (often using data mining tools).

• Transformation: Data from different sources must be converted into a common format for processing. Some data may be encoded or transformed into more usable formats. Data reduction may be used to reduce the number of possible data values being considered.

• Data mining: Based on the data mining task being performed, this step applies algorithms to the transformed data to generate the desired results.

• Interpretation/evaluation: How the data mining results are presented to the users is extremely important because the usefulness of the results is dependent on it.

Various visualization and GUI strategies are used at this last step.

Transformation techniques are used to make the data easier to mine and more use

ful, and to provide more meaningful results. The actual distribution of the data may be

0

^S•l�tion

⁰

Prepro=&og

O

"'"'form•tioo

D

^{D•t. m}

hU�

lot<or><ot.tion

0

Initial Target Preprocessed Transformed Model Knowledge

data data data data

FIGURE 1.4: KDD process (modified from [FPSS96c]).

Section 1.2 Data Mining Versus Knowledge Discovery in Databases 11

modified to facilitate use by techniques that require specific types of data distributions.

Some attribute values may be combined to provide new values, thus reducing the com

plexity of the data. For example, current date and birth date could be replaced by age.

One attribute could be substituted for another. An example would be replacing a sequence of actual attribute values with the differences between consecutive values. Real valued attributes may be more easily handled by partitioning the values into ranges and using these discrete range values. Some data values may actually be removed. Outliers, extreme values that occur infrequently, may actually be removed. The data may be transformed by applying a function to the values. A common transformation function is to use the log of the value rather than the value itself. These techniques make the mining task easier by reducing the dimensionality (number of attributes) or by reducing the variability of the data values. The removal of outliers can actually improve the quality of the results. As with all steps in the KDD process, however, care must be used in performing transfor

mation. If used incorrectly, the transformation could actually change the data such that the results of the data mining step are inaccurate.

Visualization refers to the visual presentation of data. The old expression "a picture is worth a thousand words" certainly is true when examining the structure of data. For example, a line graph that shows the distribution of a data variable is easier to understand and perhaps more informative than the formula for the corresponding distribution. The use of visualization techniques allows users to summarize, extra.ct, and grasp more complex results than more mathematical or text type descriptions of the results. Visualization techniques include:

• Graphical: Traditional graph structures including bar charts, pie charts, histograms, and line graphs may be used.

• Geometric: Geometric techniques include the. box plot and scatter diagram techniques.

• Icon-based: Using figures, colors, or other icons can improve the presentation of the results.

• Pixel-based: With these techniques each data value is shown as a uniquely colored pixel.

• Hierarchical: These techniques hierarchically divide the display area (screen) into regions based on data values.

• Hybrid: The preceding approaches can be combined into one display.

Any of these approaches may be two-dimensional or three-dimensional. Visualization tools can be used to summarize data as a data mining technique itself. In addition, visualization can be used to show the complex results of data mining tasks.

The data mining process itself is complex. As we will see in later chapters, there are many different data mining applications and algorithms. These algorithms must be carefully applied to be effective. Discovered patterns must be correctly interpreted and properly evaluated to ensure that the resulting information is meaningful and accurate.