• Aucun résultat trouvé

3.3 Three off-centered measures designed for imbalanced data

3.3.1 Residual effect induced by the off-centering operation

these individuals, let us suppose that 20 people experienced an undesired situation while 80 people did not. Therefore, the initial distribution is (0.2,0.8). During the induction process of the tree, the algorithm tries to move away from this

dis-3.3. Three off-centered measures designed for imbalanced data 87 tribution. The interest of the researcher in the social sciences in using the tree is to extract factors able to explain the fact of having experienced the undesired situation. In other words, the researcher is interested in the findings of a particular variable split, with possibly an interaction effect with another variable, that leads to a greater proportion of the minority class in at least on of the nodes. In this context, a situation that makes emerge a node with a distribution (0.3,0.7) has to receive a better utility score than a situation that makes emerge a node with a distribution of (0.1,0.9). Both situations represent the same distance of 0.1 (10 individuals) from the initial situation, but one in favor of the minority class and the other in favor of the majority class.

Table 3.4 –Illustration of the residual effect induced by off-centering entropy measures. When the objective is to extract a node with a high proportion of the minority class, the best split is S1. However, according to the value of the entropy gains, the split S2 is chosen. Entropy values computed with the Shannon entropy off-centered with Lallich’s method and an initial distribution of (0.2,0.8).

Split S1 Split S2

Distribution Entropy Distribution Entropy Root/parent node (0.20,0.80) 1.0000 (0.20,0.80) 1.0000

Child 1 (0.30,0.70) 0.9887 (0.10,0.90) 0.8113

Child 2 (0.15,0.85) 0.9544 (0.25,0.75) 0.9972

Child 3 (0.15,0.85) 0.9544 (0.25,0.75) 0.9972

Split entropy 0.9659 0.9352

Split entropy gain 0.0341 0.06479

Split with best child x

Split performed x

Now, let us consider the Shannon entropy off-centered with Lallich’s method and the initial imbalance of (wm= 0.2,0.8). By assumption, this situation is the less relevant one. Therefore, the entropy measure takes its maximal value on it.

The entropy is illustrated in Figure3.3. The graphic shows that the entropy value taken at the point (0.3,0.7) is greater than the value taken at the point (0.1,0.9).

It means that for a same distance to the initial distribution, the off-centering favors a distribution that reduces the proportion of the infrequent class.

To illustrate, let us consider two possible splits S1 and S2. To keep things simple, both splits are assumed to have 3 children representing each 1/3 of ob-servations of the parent node. Let us assume that the first split makes emerge a node with the distribution (0.3,0.7) and the second split makes emerge a node with the distribution (0.1,0.9). We also assume that the individuals are equally distributed among the other nodes. This situation is represented in Table3.4with the corresponding entropy values. The results show that the weighted entropy of the split S1 is higher than the weighted entropy of the split S2. As a result, the best entropy gain is performed by the split S2. Therefore, the classification tree chooses the split S2 although the split S1 gives the best child. This result goes against the objective of identifying high subgroups in which the infrequent class is highly visi-ble. I recall that, in the present work, classification trees are intended to be used in the exploratory stage of a study. In this context, the strategy that I propose is to explore the space of predictors by constructing several models successively and interactively to identify subgroups of observations with the highest possible proportion of the minority class. Therefore, a model that identifies a subgroup

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

Probability distribution: (x, 1−x), x in [0;1]

Entropy

0.1 0.1

ωm=0.2

Figure 3.3 –Emergence of a residual effect due to an off-centering: Illustration on the (Lallich et al.,2007) change in variable (see Figure2.7)

whose the recall of the minority class is 30% will be considered as a better model than another model that identifies two subgroups whose the recall of the minority class is 25%, even if actually the second model gives a higher overall recall of the minority class. An extreme case of this reasoning would be to ignore all misclassifi-cations of the other classes than the minority class. However, such a strategy would not consider that identifying observations of other classes correctly contributes to improve the identification of the minority class. This statement especially concerns binary response variables because with only two classes any action on one of the classes acts by complementarity on the other class. Therefore, a balance has to be considered between trying to improve the maximal recall of the class to predict without refusing classifications of very good qualities on the other classes, which also go in the direction of this objective.

The residual effect caused by off-centering an entropy measure occurs only for splits of 3 or more children. Indeed, in the specific case of a two-child split, the child nodes have their probability distribution being complementary to (1,1).

Therefore, considering a distance from the initial distribution of 0.1, there is only one possible split consisting in one node having a distribution increased by 0.1 and the other node having a distribution decreased of 0.1.

In the next Sections, I put forward three different approaches to reduce this residual effect. When growing a classification tree, the choice of the best split is made by looking at the entropy reduction. The strategy I have in mind when putting forward the following tree growing measures is to design the entropy

mea-3.3. Three off-centered measures designed for imbalanced data 89 sure to ensure that reducing the value of the entropy leads to improve the propor-tion of the infrequent class. To this purpose, I act on the shape of the entropy in such a way that for a rise of the frequency of x, x ∈ [0,1] of a class, the split is done in favor of the infrequent class than one other class. To reach this goal, I put forward two approaches: (1) relaxing the concavity of the entropy or (2) accepting the measure is not null on at least one pure node. The first approach allows to reduce the residual effect induced by the off-centering but does not eliminate it entirely. The second approach offers the possibility to remove the residual effect entirely but leads to leaving the entropy measure concept.