KNOWLEDGE DISCOVERY AND DATA MINING

Knowledge discovery in databases (KDD) is defined as the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [17, 19]. The overall process consists of turning low-level data into high-level knowledge. The KDD process is outlined in Fig. 1.1. It is interactive and iterative involving, more or less, the following steps [20]:

6 INTRODUCTION TO DATA MINING

1. Understanding the application domain: This includes relevant prior knowl-edge and goals of the application.

2. Extracting the target dataset: This is nothing but selecting a dataset or focusing on a subset of variables, using feature ranking and selection techniques.

3. Data preprocessing: This is required to improve the quality of the actual data for mining. This also increases the mining efficiency by reducing the time required for mining the preprocessed data. Data preprocess-ing involves data cleanpreprocess-ing, data transformation, data integration, data reduction or data compression for compact representation, etc.

(a) Data cleaning: It consists of some basic operations, such as normal-ization, noise removal and handling of missing data, reduction of redundancy, etc. Data from real-world sources are often erroneous, incomplete, and inconsistent, perhaps due to operational error or system implementation flaws. Such low-quality data needs to be cleaned prior to data mining.

(b) Data integration: Integration plays an important role in KDD. This operation includes integrating multiple, heterogeneous datasets gen-erated from different sources.

(c) Data reduction and projection: This includes finding useful fea-tures to represent the data (depending on the goal of the task) and using dimensionality reduction, feature discretization, and feature extraction (or transformation) methods. Application of the prin-ciples of data compression can play an important role in data re-duction and is a possible area of future development, particularly in the area of knowledge discovery from multimedia dataset.

4. Data mining: Data mining constitutes one or more of the following functions, namely, classification, regression, clustering, summarization, image retrieval, discovering association rules and functional dependen-cies, rule extraction, etc.

5. Interpretation: This includes interpreting the discovered patterns, as well as the possible (low-dimensional) visualization of the extracted pat-terns. Visualization is an important aid that increases understandability from the perspective of humans. One can evaluate the mined patterns automatically or semiautomatically to identify the truly interesting or useful patterns for the user.

6. Using discovered knowledge: It includes incorporating this knowledge into the performance system and taking actions based on the knowledge.

In other words, given huge volumes of heterogeneous data, the objective is to efficiently extract meaningful patterns that can be of interest and hence

KNOWLEDGE DISCOVERY AND DATA MINING 7

useful to the user. The role of interestingness is to threshold or filter the large number of discovered patterns, and report only those which may be of some use. There are two approaches to designing a measure of interestingness of a pattern, namely, objective and subjective. The former uses the structure of the pattern and is generally quantitative. Often it fails to capture all the complexities of the pattern discovery process. The subjective approach, on the other hand, depends additionally on the user-who examines the pattern. Two major reasons why a pattern is interesting from the subjective (user-oriented) point of view are as follows [21].

• Unexpectedness: When it is "surprising" to the user, and this potentially delivers new information to the user.

• Actionability. When the user can act on it to her/his advantage to fulfill the goal.

Though both these concepts are important, it has often been observed that actionability and unexpectedness are correlated. In literature, unexpectedness is often defined in terms of the dissimilarity of a discovered pattern from a predefined vocabulary provided by the user.

As an example, let us consider a database of student evaluations of different courses offered at some university. This can be defined as EVALUATE (TERM,

YEAR, COURSE, SECTION, INSTRUCTOR, INSTRUCT-RATING, COURSE-RATING). We

describe two patterns that are interesting in terms of actionability and unex-pectedness respectively. The pattern that Professor X is consistently getting the overall INSTRUCT.RATING below overall COURSE-RATING can be of inter-est to the chairperson, because this shows that Professor X has room for improvement. If, on the other hand, in most of the course evaluations the overall INSTRUCT.RATING is higher than COURSEJRATING and it turns out that in most of Professor X's rating the overall INSTRUCTJIATING is lower than COURSE-RATING, then such a pattern is unexpected and hence interesting.

Data mining is a step in the KDD process consisting of a particular enumer-ation of patterns over the data, subject to some computenumer-ational limitenumer-ations.

The term pattern goes beyond its traditional sense to include models or struc-tures in the data. Historical data are used to discover regularities and improve future decisions [22]. The data can consist of (say) a collection of time series descriptions that can be learned to predict later events in the series.

Data mining involves fitting models to or determining patterns from ob-served data. The fitted models play the role of inferred knowledge. Deciding whether the model reflects useful knowledge or not is a part of the overall KDD process for which subjective human judgment is usually required. Typ-ically, a data mining algorithm constitutes some combination of the following three components.

• The model: The function of the model (e.g., classification, clustering) and its representational form (e.g., linear discriminants, decision trees).

8 INTRODUCTION TO DATA MINING

A model contains parameters that are to be determined from data for the chosen function using the particular representational form or tool.

• The preference criterion: A basis for preference of one model or set of parameters over another, depending on the given data. The criterion is usually some form of goodness-of-fit function of the model to the data, perhaps tempered by a smoothing term to avoid over-fitting, or generating a model with too many degrees of freedom to be constrained by the given data.

• The search algorithm: The specification of an algorithm for finding par-ticular models or patterns and parameters, given the data, model(s), and a preference criterion.

A particular data mining algorithm is usually an instantiation of the model-preference-search components. Some of the common model functions in cur-rent data mining practice include [13, 14]:

1. Classification: This model function classifies a data item into one of several predefined categorical classes.

2. Regression: The purpose of this model function is to map a data item to a real-valued prediction variable.

3. Clustering: This function maps a data item into one of several clusters, where clusters are natural groupings of data items based on similarity metrics or probability density models.

4. Rule generation: Here one mines or generates rules from the data. Asso-ciation rule mining refers to discovering assoAsso-ciation relationships among different attributes. Dependency modeling corresponds to extracting significant dependencies among variables.

5. Summarization or condensation: This function provides a compact de-scription for a subset of data. Data compression may play a significant role here, particularly for multimedia data, because of the advantage it offers to compactly represent the data with a reduced number of bits, thereby increasing the database storage bandwidth.

6. Sequence analysis: It models sequential patterns, like time-series analy-sis, gene sequences, etc. The goal is to model the states of the process generating the sequence, or to extract and report deviation and trends over time.

The rapid growth of interest in data mining [22] is due to the (i) advance-ment of the Internet technology and wide interest in multimedia applications in this domain, (ii) falling cost of large storage devices and increasing ease of collecting data over networks, (iii) sharing and distribution of data over the network, along with adding of new data in existing data repository, (iv)

KNOWLEDGE DISCOVERY AND DATA MINING 9

development of robust and efficient machine learning algorithms to process this data, (v) advancement of computer architecture and falling cost of com-putational power, enabling use of comcom-putationally intensive methods for data analysis, (vi) inadequate scaling of conventional querying or analysis meth-ods, prompting the need for new ways of interaction, (vii) strong competitive pressures from available commercial products, etc.

The notion of scalability relates to the efficient processing of large datasets, while generating from them the best possible models. The most commonly cited reason for scaling up is that increasing the size of the training set often increases the accuracy of learned classification models. In many cases, the degradation in accuracy when learning from smaller samples stems from over-fitting, presence of noise, and existence of large number of features. Again, scaling up to very large datasets implies that fast learning algorithms must be developed. Finally, the goal of the learning (say, classification accuracy) must not be substantially sacrificed by a scaling algorithm. The three main approaches to scaling up include [23]

• designing a fast algorithm: improving computational complexity, opti-mizing the search and representation, finding approximate solutions to the computationally complex (NP complete or NP hard) problems, or taking advantage of the task's inherent parallelism;

• partitioning the data: dividing the data into subsets (based on instances or features), learning from one or more of the selected subsets, and possibly combining the results; and

• using a relational representation: addresses data that cannot feasibly be treated as a single flat file.

Some examples of mined or discovered patterns include 1. Classification:

(a) People with age less than 25 and salary > 40K drive sports cars.

(b) Set of images that contain a car as an object.

2. Association rules:

(a) 80% of the images containing a car as an object also contain a blue sky.

(b) 98% of people who purchase diapers also buy baby food.

3. Similar time sequences:

(a) Stocks of companies A and B perform similarly.

(b) Sale of furniture increases with the improvement of real estate busi-ness.

10 INTRODUCTION TO DATA MINING

4. Outlier detection:

(a) Residential customers for a telecom company, with businesses at home.

(b) Digital radiographs of lungs, with suspicious spots.

Dans le document Data Mining (Page 24-29)