• Aucun résultat trouvé

3.3 Three off-centered measures designed for imbalanced data

3.3.4 The MDC score

(3.2) Where h is a classical entropy measure and ¯w (Equation 3.3) accounts for the distance to the balance of each element. We add 1 to ensure the positivity of the vector.

¯

w= ( ¯w1, . . . ,w¯`) = (1 + (1

`w1), . . . ,1 + (1

`w`)) (3.3) As this new measure leads to have a positive value on at least one pure node, it is no longer an entropy. Therefore, I refer to this measure as a score, themaximal distance off-centered score. This proposition is consistent with the maximal dis-tance assignment rule (Eq. 2.25) discussed in Section2.2.5. Figure3.5 illustrates the maximal distance (MD) off-centered score on the Shannon entropy for different levels of data imbalance.

0.0 0.2 0.4 0.6 0.8 1.0

0.700.750.800.850.900.951.00

Probability distribution: (x, 1−x), x in [0;1]

Score

ωm=0.05 ωm=0.1 ωm=0.2 ωm=0.3 ωm=0.5

Figure 3.5 –Illustration of the maximal distance (MD) off-centered score for some levelswmof data imbalance.

3.3.4 The MDC score

Based on the sample size sensitivity strategy introduced in Marcellin (2008) and Zighed et al. (2010), I propose to extend the method I put forward in Section3.3.3 by adding the consistency property. In order to give more flexibility to add pa-rameters or properties in this property of consistency, I decided to not use Laplace

estimator but to define by hand a function answering my needs. Thus, I set up the following increasing function aiming to set the entropy near from 0 when we have few instances and return the entropy when the number of instancesnbecomes big:

ϕ: N×]0,1] → [0,1]

(n, α) 7→ (e−n+log(α1−1)+ 1)∗α

where n is the number of instances in a particular node and α is the entropy or score value for a particular distributionp. Thus,ϕ(0, α) = 0 and lim

n→∞ϕ(n, α) =α. This function is not defined forα= 0, but by setting in this caseαat the nearest number from 0 the computer can computes (2−32 for a 32-bit processor, 2−64for a 64-bit processor) we avoid the problem.

Then, I consider the total number of instancesN of the dataset to normalize the function:

φ: N×N×[0,1] → [0,1]

(n, N, α) 7→ (eNn+log(α1−1)+ 1)∗α

I combine this consitency function with a change in variable in the same vein that the one defined in 3.3.3. Figure 3.6 shows this score function for some w whereas3.7shows it for someN values atw= (0.3,0.7).

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

Probability distribution: (x, 1−x), x in [0;1]

Score

x = 0.05 x = 0.10 x = 0.20 x = 0.30 x = 0.50

Figure 3.6 –The maximal distance and consistent (MDC) off-centered score for some levelswmof data imbalance,N= 1000

3.3. Three off-centered measures designed for imbalanced data 93

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

Probability distribution: (x, 1−x), x in [0;1]

Score

nInstancesNode = 1000 nInstancesNode = 100 nInstancesNode = 50 nInstancesNode = 10 nInstancesNode = 1

Figure 3.7 – The maximal distance and consistent (MDC) off-centeredscore for some sample sizes,w= (0.3,0.7)

Contrary to Figure 3.5, we can see on Figure 3.6 that I adjusted the score function in two points: (1) I strictly respect the fact that increasing the proportion of the minority class is better than increasing the proportion of the majority class and (2) I adjust the entropy to allow it to take low values when it’s possible. Point (1) means when we move away from the initial distribution, it is more interesting to do it to increase the proportion of the minority class. Point (2) means the entropy varies on the larger possible range in [0,1]. Thus, we have more margin to set the stopping criterion (i.e. threshold on the entropy gain). The function is also more in the same vein as measures defined previously.

In addition, this method seems to provide a natural stopping criterion. When there are not many instances in the leaves, the entropy gain is low and the devel-opment of the tree stops.

Another stopping criterion allowing to adjust the tree depth is the minimum number of instances child nodes have to contain for a candidate split to be accepted.

This criterion is widely used in classification trees that do not use a pruning pro-cedure. However, I think this criterion does not suit in an class imbalance context.

To illustrate, let us consider a minimum threshold of 3,000 instances for a dataset having 50,000 instances and an 5%/95% imbalance. A pure node on the minor-ity class contains at most 2500 instances. Therefore, a 3000 instances minimum threshold makes impossible to find a pure node on the minority class. This com-ment addresses the absolute lack of data issue. While off-centering split goodness measures addresses relative data imbalance, I advocate that stopping criterion has

to consider absolute data imbalance. For this purpose, my strategy is to use, in-stead of a fixed minimum threshold on the number of instances in child nodes, a variable threshold. This threshold is expected to vary according to the observed size of each class to allow smaller child nodes when their contains more observa-tions of infrequent classes. Let us consider a response variable with ` classes and Nta minimum threshold on the number of instances in child nodes. I consider this threshold applies when a node is distributed as the initial observed situation. Oth-erwise, the threshold is descreased proportionally to the number of observations of the less frequent classes. I define the imbalanced data consistent (IDC) stopping criteria for a binary response variable as follows:

ψ: [0,1]`×[0,1]`×N → R

This method aims to deal with the class imbalance issue (absolute and relative lack of data) and the fact that the expert is more interested by increasing the proportion of the minority class than the majority class (distance to the worst distribution w). Furthermore, I think the method can deal with the induction margin issue: Indeed the unbalanced stopping criterion allows to limit the growing of the tree for the majority class but let the growing continue for the minority class.

Thus, the method both deals with the relative lack of data (even with the entropy gain measure, because the score function does not vary on the same ranges for each class) and the absolute lack of data thanks to the previous stopping criterion.

However, a possible limitation of this method is the computational cost. This cost is more important than for previous methods. Experiments will allow to better understand the behavior of this score function and to observe if it has an interest or not to deal with the class imbalance issue. In fact, the behavior of this score function might be different than an entropy.