• Aucun résultat trouvé

1.1 Data Mining

1.1.4 Sequential Mining

A sequence is an ordered set of item-sets. All transactions of a particular customer made at different times can be taken as a sequence. The term support is used in a different meaning. Here the support is incremented only once, even if a customer has bought the same item several times in different transactions. Usually the Web and scientific data are sequential in nature. Finding patterns from such data helps to predict future activities, interpreting recurring phenomena, extracting outstanding comparisons for close attention, compressing data and detecting intrusion.

The incremental mining of sequential data helps in computing only the difference by accessing the updated part of the database and datastructure. The sequential data are text, music notes, satellite data, stock prices, DNA sequences, weather data, histories of medical records, log files, etc.. The applications of sequential mining are analysis of customer purchase patterns, stock market analysis, DNA sequences, computational biology study, scientific experiments, disease treatments, Web access patterns, telecommunications, biomedical research, prediction of natural disasters and system performance analysis etc..

1.1.5 Clustering

Clustering is the process of grouping the data into classes so that objects within a cluster are similar to one another, but are dissimilar to objects in other clusters.

Various distance functions are used to make quantitative determination of similarity and an objective function is defined with respect to this distance function to measure the quality of a partition. Clustering is an example for unsupervised learning. It can be defined as, given n data points in a d-dimensional metric space, partition the data

Clustering has roots in data mining, biology and machine learning. Once the clus-ters are decided, the objects are labeled with their corresponding clusclus-ters, and com-mon features of the objects in a cluster are summarized to form the class description.

For example, a set of new diseases can be grouped into several categories based on the similarities in their symptoms, and the common symptoms of the diseases in a category can be used to describe that group of diseases. Clustering is a useful tech-nique for the discovery of data distribution and patterns in the underlying database.

It has been studied in considerable detail by both statistics and database researchers for different domains of data. As huge amounts of data are collected in databases, cluster analysis has recently become a highly active topic in data mining research.

Various applications of this method are, data warehousing, market research, seismol-ogy, minefield detection, astronomy, customer segmentation, computational biology for analyzing DNA microarray data and World Wide Web.

Some of the requirements of clustering in data mining are scalability, high di-mensionality, ability to handle noisy data, ability to handle different types of data etc. Clustering analysis helps to construct meaningful partitioning of a large set of objects based on a divide and conquer methodology. Given a large set of multidi-mensional data points, the data space is usually not uniformly occupied by the data points, hence clustering identifies the sparse and the crowded areas to discover the overall distribution patterns of the dataset. Numerous applications involving data warehousing, trend analysis, market research, customer segmentation and pattern recognition are high dimensional and dynamic in nature. They provide an oppor-tunity for performing dynamic data mining tasks such as incremental and associa-tion rules. It is challenging to cluster high dimensional data objects, when they are skewed and sparse. Updations are quiet common in dynamic databases and usually they are processed in batch mode. In very large databases, it is efficient to incre-mentally perform cluster analysis only to the updations. There are five methods of clustering; they are (i) Partitioning method (ii) Grid based method (iii) Model based method (iv) Density based method (v) Hierarchical method.

Partition Method: Given a database of N data points, this method tries to form, k clusters, where k≤N. It attempts to improve the quality of clusters or partition by moving the data points from one group to another. Three popular algorithms under this category are k-means, where each cluster is represented by the mean value of the data points in the cluster and k-medoids, where each cluster is represented by one of the objects situated near the center of the cluster, whereas k-modes extends the k-means to categorical attributes. The k-means and the k-modes methods can be combined to cluster data with numerical and categorical values and this method is called k-prototypes method. One of the disadvantage of these methods is that, they are good in creating spherical shaped clusters in small databases.

Grid Based Method: This method treats the database as a finite number of grid cells due to which it becomes very fast. All the operations are performed on this grid structure.

8 1 Introduction Model Based Method: is a robust clustering method. This method locates clusters by constructing a density function which denotes the spatial distribution of the data points. It finds number of clusters based on standard statistics taking outliers into consideration.

Density Based Method: finds clusters of arbitrary shape. It grows the clusters with as many points as possible till some threshold is met. The e-neighborhood of a point is used to find dense regions in the database.

Hierarchical Method: In this method, the database is decomposed into several lev-els of partitioning which are represented by a dendrogram. A dendrogram is a tree that iteratively splits the given database into smaller subsets until each subset con-tains only one object. Here each group of size greater than one is in turn composed of smaller groups. This method is qualitatively effective, but practically infeasible for large databases, since the performance is at least quadratic in the number of database points. Consequently, random sampling is often used in order to reduce the size of the dataset. There are two types in hierarchical clustering algorithms;

(i) Divisive methods work by recursively partitioning the set of datapoints S un-til singleton sets are obtained. (ii) Agglomerative algorithms work by starting with singleton sets and then merging them until S is covered. The agglomerative meth-ods cannot be used directly, as it scales quadratically with the number of data points.

Hierarchical methods usually generate spherical clusters and not of arbitrary shapes.

The data points which do not belong to any cluster are called outliers or noise.

The detection of outlier is an important datamining issue and is called as outlier mining. The various applications of outlier mining are in fraud detection, medical treatment etc..

1.1.6 Classification

Classification is a process of labeling the data into a set of known classes. A set of training data whose class label is known is given and analyzed, and a classification model is prepared from the training set. A decision tree or a set of classification rules is generated from the clasification model, which can be used for better under-standing of each class in the database and for classification of data. For example, classification rules about diseases can be extracted from known cases and used to diagnose new patients based on their symptoms. Classification methods are widely developed in the fields of machine learning, statistics, database, neural network, rough sets and are an important theme in data mining. They are used in customer segmentation, business modeling and credit analysis.

1.1.7 Characterization

Characterization is the summarization of a set of task relevant data into a relation, called generalized relation, which can be used for extraction of characteristic rules.

example, the symptoms of a specific disease can be summarized by a set of char-acteristic rules. Methods for efficient and flexible generalization of large data sets can be categorized into two approaches: the data cube approach and the attribute-oriented induction approach.

In the data cube approach, a multidimensional database is constructed which con-sists of a set of dimensions and measures. A dimension is usually defined by a set of attributes which form a hierarchy or a lattice of structure. A data cube can store pre-computed aggregates for all or some of its dimensions. Generalization and spe-cialization can be performed on a multiple dimensional data cube by roll-up or drill-down operations. A roll-up operation reduces the number of dimensions in a data cube, or generalizes attribute values to higher level concepts. A drill-down opera-tion does the reverse. Since many aggregate values may need to be used repeatedly in data analysis, the storage of precomputed aggregates in a multiple dimensional data cube will ensure fast response time and offer flexible views of data from dif-ferent angles and at difdif-ferent levels of abstraction. The attribute-oriented induction approach may handle complex types of data and perform attribute relevance analysis in characterization.

1.1.8 Discrimination

Discrimination is the discovery of features or properties that distinguish the class being examined(target class) from other classes(contrasting class). The method for mining discriminant rules is similar to that of mining characteristic rules except that mining should be performed in both target class and contrasting classes syn-chronously to ensure that the comparison is performed at comparative levels of ab-straction. For example, to distinguish one disease from others, a discriminant rule summarizes the symptoms of this disease from others.