• Aucun résultat trouvé

Collecting Observations about the Behavior of the System 13

Part I Algorithmic Issues

1.3 The Data Mining and Knowledge Discovery Process

1.4.1 Collecting Observations about the Behavior of the System 13

A key problem with any analysis of data is what information to consider when collecting observations in order to study a system of interest. There is not an easy way to answer this question. The observations should describe the behavior of the system under consideration in a way such that when they are analyzed by DM&KD means, the extracted patterns will be effective in meeting the goals of the analysis.

Each data point is assumed to be a vector of some dimension n which describes the behavior of the system in terms of n attributes. It could also be the case that these attributes (or just some of them) are organized in a tree-like hierarchy. After the analysis, it is possible that some of the attributes are to be dropped out as being insignificant or just irrelevant. Furthermore, collecting information about different attributes may involve entirely different costs. For instance, in a medical setting the cost of collecting information about a patient’s blood pressure is far less than the cost of performing a biopsy of a lesion from, say, the liver or the brain of the patient.

The analyst may wish to first collect information about easily obtainable attributes. If the inferred model is not accurate enough and/or easily interpretable, then the analyst may wish to consider more attributes and augment the analysis.

Another key problem is how to identify noise in the data and clean them. The danger here is that what appears to be noise in reality might be legitimate outliers and thus an excellent and rare opportunity to find evidence of some rare but very important aspects of the system under consideration might be ignored. In other words, such noise might be disguised nuggets of valuable new knowledge. As with outliers in a traditional statistical study, one has to be very careful in removing or keeping such data.

Figure 1.4. A Random Sample of Observations Classified in Two Classes.

1.4.2 Identifying Patterns from Collections of Data

In order to help fix ideas, consider the observations depicted in Figure 1.4. These observations are defined in terms of the two attributes A1and A2. Each observation is represented by either a gray circle or a solid dark circle.

The main question that any DM&KD analysis tries to answer is what can one learn about these data? Such knowledge may next be used to accurately predict the class membership (in this case is it a “gray” or “solid dark” observation?) of new and unclassified observations. Such knowledge may also lead to a better understanding of the inner workings of the system under consideration, the design of the experiments to refine the current understanding of the system, and so on.

There are many ways one can define the concept of knowledge given a set of observations grouped into different classes. It seems that most experts agree that given observations of data grouped into different categories (classes), knowledge is any pattern implied by these data which has the potential to answer the previous main question. This understanding of knowledge makes the quest of acquiring new know-ledge an ill-defined problem as there might be more than one pattern implied by the data. Thus, what is the best pattern? The direction adopted in the following develop-ments is that the best pattern among a set of candidate patterns is the simplest one but still sufficient to answer the previous main question. This kind of philosophy is not new. As the medieval philosopher William of Occam (also known as Okham) stated in his famous “razor”: Entia non sunt multiplicanda prater necessitatem (plurality should not be assumed without necessity).

Many DM&KD approaches interpret the above need in a way that tries to mini-mize the complexity of the derived pattern. Such patterns can be derived in terms of a decision tree, a set of separating planes, statistical models defined on some param-eters, classification rules, etc. Then the need is to derive a decision tree with a mini-mum number of branches and/or nodes; in the case of separating planes, a minimini-mum number of such planes; in the case of a statistical model, a model with the minimum number of parameters; or the minimum number of classification rules and so on.

However, obtaining a minimum number of the previous pattern entities may be computationally a very difficult, if not impossible, task. Thus, a more practical approach oftentimes is to develop fast heuristics that derive a very small number of such pattern entities. In the majority of the DM&KD methods to be discussed in the following chapters the above general objective is interpreted by deriving a minimum or near-minimum number of classification rules. The above are also consistent with the well-known set covering problem.

Next, suppose that instead of the rather complex data set depicted in Figure 1.4 now we have the rather simpler data set depicted in Figure 1.5. What pattern in the form of classification rules can be implied by these data?

The answer to this question may not be unique. Figure 1.6 presents a possi-ble answer to this question. This answer is represented by the single square block that encloses all the solid dark points without including any of the gray points. One may consider a number of blocks that, collectively, achieve the same goal. Simi-larly, one may consider a number of different single-box solutions that do meet the same goal.

Figure 1.5. A Simple Sample of Observations Classified in Two Categories.

Figure 1.6. A Single Classification Rule as Implied by the Data.

Such a box implies a classification rule defined in terms of the two attributes A1

and A2. In general, such boxes are convex polyhedrals defined on the space of the attributes. For the single box in Figure 1.6 this is indicated by the dotted lines that define the ranges of values defined as [a1,a2] and [b1,b2] for the attributes A1and A2, respectively. If the coordinates of a new observation fall inside these two ranges, then according to the rule depicted by the solid box in Figure 1.6, this observation belongs to the “solid dark” class. In the majority of the methods described in this book we will attempt to derive solutions like the one in Figure 1.6. That is, we will employ minimization approaches on the complexity of the derived pattern when such a pattern is expressed in terms of a compact Boolean function or, equivalently, in terms of a few classification rules.

For instance, for the case of the classification rule depicted as the solid box in Figure 1.6, this Boolean function has the form (where “∧” indicates the logical “and”

operator)

FF

F =(a2A1a1)(b2A2b1).

Thus, the corresponding classification rule is

IF (the value of A1is between a1and a2) and (the value of A2is between b1and b2)

THEN the observation belongs to the “solid dark” class.

Similarly with the above simple case, when the data depicted in Figure 1.4 are treated according to the previous minimization objective, then a possible set of classi-fication rules is the set of boxes depicted in Figure 1.7. This figure depicts rules both

Figure 1.7. Some Possible Classification Rules for the Data Depicted in Figure 1.4.

for the gray and also for the solid dark sampled observations (depicted as dotted and solid-line boxes, respectively).