• Aucun résultat trouvé

Data Mining Algorithms/Methodologies

Enterprise Data Mining: A Review and Research Directions

2. The Basics of Data Mining and Knowledge Discovery

2.2 Data Mining Algorithms/Methodologies

Methodologies are necessary in most steps of the data mining and knowledge discovery process. Numerous algorithms/techniques have been developed for both descriptive mining and predictive mining.

Descriptive mining relies much on descriptive statistics, the data cube (or OLAP; short for On-Line Analytical Processing) approach, and the attribute-oriented induction approach.

The OLAP approach provides a number of operators such as roll-up, drill-down, slice and dice, and rotate in a user-friendly environment for interactive querying and analysis of the data stored in a multi-dimensional database. The attribute-oriented approach is a relational database query-oriented, generalization-based, on-line data analysis technique. The general idea is to first collect the task-relevant data using a relational database query and then perform generalization either by attribute removal or attribute aggregation based on the examination of the number of distinct values of each attribute in the relevant set of data.

On the other hand, the purpose of predictive mining is to find useful patterns in the data to make nontrivial predictions on new data. Two major categories of predictive mining techniques are those which express the mined results as a black box whose innards are effectively incomprehensible to non-experts and those which represent the mined results as a transparent box whose construction reveals the structure of the pattern. Neural networks are major techniques in the former category.

Focusing on the latter, the book of Witten and Frank (2000) includes methods for constructing decision trees, classification rules, association rules, clusters, and instance-based learning; the book edited by Triantaphyllou and Felici (2006) covers many rule induction techniques;

and the recent monograph by Triantaphyllou (2007) is mostly devoted to the learning of Boolean functions.

Hand et al. (2001) discussed regression models with linear structures, piecewise linear spline/tree models that represent a complex global model for nonlinear phenomena by simple local linear components, and nonparametric kernel models. The spline/tree models replace the data points by a function which is estimated from a neighborhood of data points. Kernel methods and nearest neighbor methods are alternative

local modeling methods that do not replace the data by a function, but retain the data points and leave the estimation of the predicted value until the time at which a prediction is actually required. Kernel methods define the degree of smoothing in terms of a kernel function and bandwidth whereas nearest neighbor methods let the data determine the bandwidth by defining it in terms of the number of nearest neighbors. Two major weaknesses of local methods are that they are poorly scaled to high dimension and the lack of interpretability of models built by local methods.

Soft computing methodologies such as fuzzy sets, neural networks, genetic algorithms, rough sets, and hybrids of the above are often used in the data mining step of the overall knowledge discovery process. This consortium of methodologies works synergistically and provides, in one form or another, flexible information processing capability for handling real-life ambiguous situations. Mitra et al. (2002) surveyed the available literature on using soft computing methodologies for data mining, not necessary related to enterprise systems.

Support vector machines (SVMs), originally designed for binary classification (Corts and Vapnik, 1995) and later extended to multi-class classification (Hsu and Lin, 2002), have gained wider acceptance for many classification and pattern recognition problems due to their high generalization ability (Burges, 1998). SVMs are known to be very sensitive to outliers and noise. Hence, Huang and Liu (2002) proposed a fuzzy support vector machine to address the problem. The central concept of their fuzzy SVM is not to treat every data points equally, but to assign each data point a membership value in accordance with its relative importance in the class.

A high degree of interactivity is often desirable, especially in the initial exploratory phase of the data mining and knowledge discovery process. This emphasis calls for the visualization of data as well as the analytical results. Visual exploration techniques are thus indispensable in conjunction with automatic data mining techniques. Oliveira and Levkowitz (2003) surveyed past studies on the different uses of graphical mapping and interaction techniques for visual data mining of large datasets represented as table data.

Keim (2002) proposed a classification of information visualization and visual data mining techniques based on the data type to be visualized, the visualization technique, and the interaction and distortion technique.

The data type to be visualized may be one dimensional, two dimensional, or multidimensional, text or hypertext, hierarchies or graphs, and algorithms or software. The visualization techniques used may be classified into standard 2D/3D displays, geometrically transformed displays, icon-based displays, dense pixel displays, and stacked displays.

The interaction and distortion techniques may be classified into interactive projection, interactive filtering, interactive distortion, and interactive linking and brushing.

One of the challenges to effective data mining is how to handle vast volumes of data. One solution is to reduce data for mining. Data reduction can be achieved in many ways: by feature (or attribute) selection, by discretizing continuous feature-values, and by selecting instances. Feature selection is the process of identifying and removing irrelevant and redundant information as much as possible. Feature selection is important because the inclusion of irrelevant, redundant, and noisy attributes in the model building process can result in poor predictive performance as well as increased computation. Hall and Holmes (2003) presented a benchmark comparison of six attribute selection methods for supervised classification using 15 standard machine learning datasets from the widely known UCI collection (http://kdd.ics.uci.edu/). Since some attribute selection methods only operate on discrete-valued features, numeric-valued features must be discretized first, using a method such as the one developed by Fayyad and Irani (1993). Liu and Motoda (2001) edited a book on instance selection and construction, which include a set of techniques that reduce the quantity of data by selecting a subset of data and/or constructing a reduced set of data that resembles the original data. Jankowski and Grochowski (2004a, b) compared several strategies to shrink a training dataset using different neural and machine learning classification algorithms. In their study, nearly all tests were performed on databases included in the UCI collection (Merz and Murphy, 1998).

Scaling up the data mining algorithms to be run in high-performance parallel and distributed computing environments offers an alternative solution for effective data mining. Darlington et al. (1997) presented preliminary results on their experiments in parallelizing C4.5, a classification-rule learning system that represent the learned knowledge in decision trees. Pizzuti and Talia (2003) described the design and implementation of P-AutoClass, which is a parallel version of the AutoClass system based upon the Bayesian model for determining optimal classes in large datasets. Anglano et al. (1999) presented G-Net, which is a distributed algorithm able to infer classifiers from pre-collected data. G-Net was implemented on Networks of Workstations (NOWs) and it incorporated a set of dynamic load distribution techniques to profitably exploit the computing power provided. Hall et al. (2000) discussed a three-step approach for generating rules in parallel: first creating disjoint subsets of a large training set, then allowing rules to be created on each subset, and finally merging the rules. An empirical study showed that good performance can be achieved but performance could degrade as the number of processors increased. Johnson and Kargupta (2000) presented the Collective Hierarchical Clustering (CHC) algorithm for analyzing distributed, heterogeneous data.