Classification tree specific approaches - Classification tree learning with infrequent outcomes

2.2 Classification tree learning with infrequent outcomes

2.2.4 Classification tree specific approaches

2.2.4.1 Off-centered entropy measures

In the context of data imbalance, two main criticisms are generally addressed to classical entropies (Marcellin et al., 2006; Marcellin, 2008; Lallich et al., 2007;

Ritschard et al., 2007a; Zighed et al., 2010). Firstly, in constrast to statistical methods that consider departure from the marginal distribution of the response variable, classical entropies consider that the “worst” situation is the equidistribu-tion – the distribuequidistribu-tion were all classes are equally represented. Indeed, considering a categorical variable with ` classes, the axioms characterizing entropy functions require it to take a unique maximal value in (¹_`, . . . ,¹_`). When working on a binary variable, this places the worst situation at the 50/50 distribution. Regarding the purity of nodes, it is true that this is the worst possible situation. However, let us consider a critical event which is experienced by only 1% of individuals in the full sample. In this situation, succeeding in extracting a set of values of the covari-ates that leads to experience this critical event in 50% of the cases provides much more information than the initial situation (marginal distribution of the response variable) to explain why this critical event happens. Secondly, classical entropy functions are assumed to be symmetric. In mathematics, a symmetric function¹is a function whose the value is the same for every permutation of its variables (David et al.,1967; Kung et al.,2009). Letf(x1, . . . , x`) be a symmetric function, thenf satisfies the following conditions:

1In contrast, a function for which there is at least one permutation of its arguments leading to two different values will be called asymmetric.

2.2. Classification tree learning with infrequent outcomes 53

f(x1, x2, x3, . . . , x_`−1, x`) =f(x2, x1, x3, . . . , x_`−1, x`) (2.12)

=f(x₃, x₂, x₁, . . . , x_`−1, x_`)

=. . .

=f(x_`, x₂, x₃, . . . , x_`−1, x₁)

=f(x1, x3, x2, . . . , x`−1, x`)

=. . .

=f(x₁, x₂, x₃, . . . , x_`, x_`−1)

Let us considerha two-class entropy function. This function is symmetric by design and as a result h(x,1−x) =h(1−x, x), ∀x∈[0,1]. When growing classification trees, this property implies that two splits leading respectively to the distributions (0.6,0.4) and (0.4,0.6) will be considered as equally informative. This does not take into account that we have a specific interest in one of the classes. If our interest concerns a specific class, we surely prefer to make appear a node represented with 60% of cases of this class than only 40%. Marcellin et al. (2006) study on the use of entropy measures for growing classification trees on imbalanced data. In the study, the authors define new axioms to characterize entropies relevant in this specific context. Especially, the author claims that a relevant entropy has (1) to take its maximal value in a reference distribution (not necessarily the equidistribution), (2) to be stable to permutations occuring simultaneously in the distribution and the reference distribution, (3) to take the value 0 on pure nodes when the sample size is sufficiently large, and (4) to be strictly concave.² Axiom 1 and 2 allow to address data imbalanced by defining off-centered entropy measures. Axioms 3 address the sensitivity to sample size. The author postulates that for two identical distributions coming from two samples of different sizes, the value of the entropy should be lower on the larger sample. The justification of this proprety is that a larger sample brings more empirical evidence and so more confidence. Although such a feature is reasonable on its principle, it can be seen more than a probability estimation issue than an axiomatic issue. Moreover, an entropy is, by definition, not defined on a distribution and not a sample. Therefore, a distinction should be done when using the term entropy on a distribution or a sample. The term entropy should be reserved for a measure defined on a distribution while the term empirical entropy could be used when referring to an entropy measure integrating an probability estimation property and defined on a sample. Axiom 4 also calls for a justification. Indeed, as previously stated, one of the axiom a measure has to satisfy to be called entropy is to have a unique global maximum. However, asking for strict concavity is a too strong assumption as for any continuous function defined over a finite domain, having a unique global maximum is already garanteed with the strictly quasiconcavity property (Sundaram, 1996). Still, concavity seems to have an impact in the induction process. For example, Simovici and Jaroszewicz (2003) shown the tree depth can be reduce by ajusting the concavity. In assessing the Dietterich-Kearns-Mansour (DKM) criterion, a purity based splitting criterion

2See Hencin (1957), Forte (1973), and Acz´el and Dar´oczy (1975) for different examples of characterization of an entropy function.

introduced by Dietterich et al. (1996) and designed for a two-class response variable, the authors shown that a higher concavity of the DKM criterion allows to obtain better results. A few years after, Drummond and Holte (2000) confirmed this observation in a class imbalance context. Therefore, the justification of Marcellin (2008) for requiring a strict concavity is that accelerating the decreasing of the entropy proportionaly to the purity is expected to enhance classification capability.

Then, the authors set up an off-centered entropy measure specifically designed to satisfy the axioms. The formula of this entropy is given in Equation2.13, where w= (w1, . . . , w`) is the user-defined distribution on which the entropy has to take its unique maximal value. Practically, this “worst” situation can be set to the expected distribution of the response variable: we want to discover situations that significantly differ from what it is expected in the population. The formula of this entropy is given in Equation 2.13. In practice, expected probabilities w_i are estimated observed empirical probabilites ˆw_i= ⁿ_nⁱ.

hA,1(p1, . . . , p`;w1, . . . , w`) =

i=1

pi(1−pi)

(−2wi+ 1)pi+w_i² (2.13) In Marcellin (2008) and Zighed et al. (2010) the authors claim that for growing classification trees, an entropy measure has to be sample sensitive. They note that entropy measures currently used for growing classification trees only account for the distribution of the sample, but not for the sample size. Actually, an entropy measure is by definition defined on a probability distribution. It is the estimation of the probabilities on a particular dataset that has to be sample sensitive. This being said, it is reasonable to consider that a split which leads to a distribution x is less relevant than a split which leads to the same distribution xand verified on a larger number of individuals. Indeed, as this last split received more empirical evidence, we can be more confident in the validity of the split. This property is called by the authors theconsistency regarding the data and concerns, of course, the empirical entropy. To set up practically the consistency, the authors suggest estimating probabilities in each node by using the Laplace smoothing estimator (Sch¨utze et al.,2008). The formula of the Laplace smoothing estimator is given in Equation2.14, where pi is the frequency of the class ci in the node Nj andnj is the number of cases in this node.

p_i= n_jp_i+ 1

nj+` (2.14)

By combining this off-centered entropy with their methodology for making sample size an entropy measure, the authors put forward a new entropy measure for growing classification trees (Equation2.15).

hA,2(nj, p1, p2, . . . , p`;w1, . . . , w`) =

i=1

pi(1−pˆi)

(−2wi+ 1)ˆpi+w²_i (2.15)

2.2. Classification tree learning with infrequent outcomes 55 In Ritschard et al. (2009b, p. 7) the authors suggest a new version of this off-centered entropy by adding a normalization coefficient. The purpose of this normalization is to force the entropy to take its values in the range [0,1]. This final version of the entropy is given in Equation2.16.

h_A,3(n_j, p₁, p₂, . . . , p_`;w₁, . . . , w_`) =1

i=1

p_i(1−pˆ_i)

(−2wi+ 1)ˆpi+w_i² (2.16) The formula of this off-centered entropyh_A,3is given in Equation2.16. Figure 2.5plots the entropyh_A,3for different levels of data imbalance but the same sample size. Figure2.6plots the entropyh_A,3for different sample sizes but the same value of data imbalance.

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

Probability distribution: (x, 1−x), x in [0;1]

Entropy

ωm=0.05 ωm=0.1 ωm=0.2 ωm=0.3 ωm=0.5

Figure 2.5 – Zighed et al. (2010) and Ritschard et al. (2009b) off-centered entropy for some levelsw_mof data imbalance,N= 1000

Another proposition of entropy measure for imbalanced data is put forward in Lallich et al. (2007). The authors define a change in variable to off-center an entropy from the equidistribution to a user-defined distribution. As this change in variable can be applied to any entropy measure, users can choose the initial shape of the entropy they want to off-center. Furthermore, this change in variable ensures the axioms defined in Marcellin (2008) are satisfied. Especially, a key objective

0.0 0.2 0.4 0.6 0.8 1.0

0.30.40.50.60.70.80.91.0

Probability distribution: (x, 1−x), x in [0;1]

Entropy

N = 5 N = 10 N = 50 N = 100 N = 1000

Figure 2.6 – Zighed et al. (2010) and Ritschard et al. (2009b) off-centered entropy for some sample sizes,w= (0.3,0.7)

of this change in variable is to preserve concavity. The change in variable was initially presented for a two-class problem, but Lenca et al. (2010) generalized it to a multiclass problem. However, as notations become complex, I prefer to introduce it for only two classes for a clarity reason. Letw= (w_m,1−wm) be the distribution on which we want to off-center an entropy measure. For example, w can be the initial marginal distribution (marginal distribution of the root node). With this notationw_m refers to the frequency of the minority class. The change in variable aims at moving this w_m to ¹₂. Letp_m be the frequency of the minority class in a particular node. We are looking for a change in variableπ(p_m) satisfying:

• πincreases form 0 to 1/2, whenpm increases form 0 towm.

• πincreases form 1/2 to 1, whenp_m increases formw_mto 1.

To this purpose, the authors put forward the change in variable given in Equation 2.17.

π(p_m) =





 pm

2wm when 0≤pm≤wm

pm+ 1−2wm

2(1−wm) whenwm≤pm≤1 (2.17) Then, an entropy measurehis off-centered by using the formula given in Equation 2.18.

h_w_m(p_m) =h(π(p_m),1−π(p_m)) (2.18)

2.2. Classification tree learning with infrequent outcomes 57 The off-centered versionhw_mofhtakes a unique maximal value inwm. To illustrate this change in variable we apply it on the Shannon entropy for different levels of data imbalance (Figure2.7). We note that the shape of the Shannon entropy off-centered by the change in variable proposed by Lallich et al. (2007) is close to the one of the Zighed et al. (2010) and Ritschard et al. (2009b) off-centered entropy, but shows a stronger concavity and is not derivable atwmwhenwm6= ¹₂.

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

Probability distribution: (x, 1−x), x in [0;1]

Entropy

ωm=0.05 ωm=0.1 ωm=0.2 ωm=0.3 ωm=0.5

Figure 2.7 –Lallich et al. (2007) off-centering of the Shannon entropy for some levelsw_mof data imbalance

2.2.4.2 The implication index

The approach taken in the methods introduced in the previous Section is about transforming a splitting measure based on uncertainty reduction, such as the Shan-non entropy or the Gini index, that are considered to be skew sensitive, in a way to make them able to consider the marginal distribution of the response variable. In contrast, the approach introduced in the series of articles Ritschard et al. (2007b, 2008, 2009b) is to focus on a measure, still addressing uncertainty reduction, but based on a statistical criterion and therefore being natively based on the marginal distribution of the response variable. For this purpose, the authors focus on the implication index. Like the methods previously introduced, only the relative lack of data is addressed, not the absolute lack of data.

The implication index is a statistical measure part of the Implicative Statis-tical Analysis (ISA) framework introduced by Gras (1979).³ This mathematical framework provides various methodologies to quantify statistical implication. ISA is particularly interested in the evaluation and identification of rules having the form A =⇒ B. In such rules,A is called the antecedent and B the consequent.

ISA especially aims to identify rules that are relevant, i.e., providing practition-ers with useful information, but non-obvious, i.e., that are not already known by a business expert and not easy to identify with standard analysis methods (Couturier and Ag Almouloud,2009). Based on a set of rules, building a classifier is achieved by selecting all or a subset of the rules that take one of the classes of the response variable as consequent. Considering a classification tree, the set of rules refer to the leaves of the tree. The class assignment performed in each leaf of the tree can be made in various ways as subsequently discussed in Section2.2.5.

In the context of ISA, rules are not expected to be true logical rules, i.e., rules that are systematically verified. In contrast, rules refer to the notion of association rules such as defined in association rule mining (Agrawal et al.,1993; J. Han et al., 2000; Wu et al., 2004). These rules mean “if we observe A then we should also observe B” (Suzuki and Kodratoff, 1998). For example, consider the rule, “if u has a foreign family name and a low level of education, then u is unemployed”.

In a particular dataset, this rule may have some examples and some exceptions (counterexamples). The implication index is based on the number of exceptions to the rule (Gras et al.,2004).

Table 2.2 –Contingency table between the response variableY and thekchild nodes{Rc1, . . . , R_ck}of a particular splitS.

Y \Nodes Rc1 . . . R_cj . . . R_ck Total

y₁

. n_1.

.. .

.. . yi . . . . . . n_ij . . . . . . n_i.

.. .

.. . y_`

. n`.

Total n_.1 . . . n_.j . . . n_.k n

Considering classification trees, the number of exceptions in a particular node refers to the number of observations that are not in the class that would be assigned to the node assuming the node is a leaf. To assess the quality of a node, Ritschard et al. (2009b) compare the number of exceptions with the number of exceptions that would be obtained in a node of the same size but distributed independently of the antecedent of the rule, that is to say, distributed according to the distribution w= (w1, . . . , w`) ofY estimated by its marginal distribution in the root node. In other words, the rule is compared to an assignment by chance. Practically, when

3A comprehensive introduction to this framework is given by Gras and Regnier (2009)

2.2. Classification tree learning with infrequent outcomes 59 all the possible splits perform less well than an assignment by chance, the induction process should be stopped.

LetS = (Rp;Rc₁, . . . , Rc_k) be a candidate split where Rp is the parent node and Rc_j, j ∈ {1, . . . , k} are the child nodes. Then, let {b¹, . . . , b^k} be the classes respectively assigned to the nodes{Rc₁, . . . , Rc_k} and let the observations in each node be distributed according to the contingency table given in Table2.2. Let us note nb¯^j. the number of exceptions (number of observation with a class different of b^j) in the root node, nb¯^jj the number of exceptions in the nodeRc_j, and n^e_¯

b^jj

the expected number of exceptions in the node R_c_j under the assumption of an independant repartition of the observations. We have nb¯^j. = n−n_bj, nb¯^jj = n.j−n_bjj, and, according to I. C. Lerman et al. (1981) and Ritschard et al. (2007b), n^e_¯

b^jj =nb¯^j.n.j¯. The implication index ˆυ0(Rc_j, b^j) of the rule derived from the node Rc_j in regards to the classb^jis given by difference between the empirical number of exceptions and the expected number of exceptions, standardized by the standard error. The formula of the implication index is given in Equation2.19.

υ0(Rc_j, b^j) =nb¯^jj−n^e_¯

b^jj

qn^e_¯

b^jj

(2.19)

A positive value of the implication index indicates that the rule performs less well than an assignment by chance and therefore does not provide any implicative information. In contrast, a negative value of the implicative index means that the rule provides implicative information. Considered in absolute value, the higher the implication index, the stronger the implicative force of the rule. Also, as the off-centered entropy introduced in Equation2.15, the implication index is sensitive to sample size. Indeed, it is easy to deduce from Equation 2.19that, for a fixed marginal distribution of the response variable in some nodes, the more individuals are in the nodes, the higher is the involvement. This sensitivity to sample size provides a natural stopping criterion of the tree induction process. This feature is similar to the notion of consistency with sample size.

Ritschard et al. (2007b) introduce a correction for continuity for the implica-tion index. This correcimplica-tion for continuity allows to compare the implicaimplica-tion index with a normal distribution. The implication index corrected for continuity is given in Equation2.20.

υ(Rc_j, b^j) = nb¯^jj−n^e_¯

b^jj+ 0.5 qn^e_¯

b^jj

(2.20)

As shown in both Equations2.20and2.19, the value of the implication index of a node depends on the class assigned to the node. A choice has to be done on how the class is assigned. Ritschard et al. (2009b) suggest to use the class that maximizes the implication intensity. Under this assumption, the mplication index

of a particular nodeRc_j is given in Equation2.21.

υ(Rc_j) = min

b^j∈{y1,...,y`}υˆ(Rc_j, b^j) (2.21) To be able to select a “best” split among the candidate splits, a measure of the implication gain as to be defined. To this purpose, Ritschard et al. (2009b) introduce three strategies. The first strategy is to use the mean of the implication indexes computed in each child node weighted by the proportion of observations in each node. The formula of the weighted implication gain ˆΥw(S) is given Equation by2.22.

ˆΥw(S) = ˆυ(Rp) − X

j∈{1,...,k}

n.j

n υˆ(Rc_j) (2.22) The second strategy is to consider only the maximal value of the implication index computed in each child node. Such an approach can be applied when looking for nodes able to generate at least one high-intensity rule and accepting to keep possibly some other nodes with poor implication intensity. The formula of the maximal implication gain is given by Equation2.23.

ˆΥm(S) = ˆυ(R_p) − max

j∈{1,...,k}υˆ(R_c_j) (2.23)

The third strategy is to aggregate at the level of observations instead of aggre-gating at the level of the nodes, and by using the formula of the implication index.

Again, the authors standardize the resulting equation and apply the correction for continuity. The formula of the total implication gain is given by Equation2.24.

ˆΥt(S) = ˆυ(Rp)− P

j∈{1,...,k}(nb¯^jj−n^e_¯

b^jj) + 0.5 qP

j∈{1,...,k}n^e_¯

b^jj

(2.24)

According to the authors, an advantage of the implication index is to be based on a criterion that performs a binary comparison (examples vs. exceptions) that is a less dispersed criterion than criteria directly based on the marginal distribution.

As a result, the classification trees induced using the implication index are expected to be more robust. Another advantage of the implication index as standardized by Ritschard et al. (2009b) is the possibility to compare it to a normal distribution.

This feature makes possible the computation of statistical significance of the rules induced by the leaves of the tree. More information about good practices to make comparisons with a normal distribution are discussed in Ritschard (2005).

2.2.4.3 The skew-insensitive Hellinger distance

Like the approach based on the implication index introduced in the previous Sec-tion, the approach of Cieslak and Chawla (2008) is to focus on a natively skew insensitive measure. To this purpose, the authors focus on the Hellinger distance

2.2. Classification tree learning with infrequent outcomes 61 (Kailath,1967; Rao, 1995). The Hellinger distance is a measure of divergence be-tween two distributions. The distance admits a definition on both continuous and countable distributions. Let (Λ) denote a measurable space, for example Λ = R, and P and Qtwo continuous distributions with respect to the parameterλ. The definition of Hellinger distance can be given as:

dH(P, Q) =

To be used in classification, a formulation of the measure for a countable space is required. For a countable space Θ, for example Θ = N or Θ ={1, . . . , `}, the The Hellinger distance takes its values in [0;√

2]. IfP =Q, thendH(P, Q) = 0 and ifP andQare completely disjoint, thendH(P, Q) =√

2. To define a classification tree method, the formulation of the Hellinger distance within a splitting criterion has to capture the propensity of a feature to separate class distributions. As the Hellinger distance is defined as a distance between two distributions, defining a splitting criterion based on this measure lead to limit the study to a two-class response variable. In addition, according to the authors, the Hellinger distance is not trivially extensible to multiple classes. In the remainder of this Section, the two classes of the response variable are referred respectively the minority class and the majority class. To define a splitting criterion, the Hellinger distance can be calculated on conditional probabilities of the minority and the majority class given the values of a descriptive variable X. The authors assume a countable space. So if X is a quantitative variable, it has to be discretized into qclasses ci

before distance computation. Considering a two-class variableY whose classes are respectively notedmthe minority class andM the majority class, the formulation of the distance between both classes is:

dH(m, M) =

The formula is easily adaptable when considering several descriptive variables.

Each descriptive variableX^j is discretized inx^j_k

j classes and the distance is com-puted on the associated partition with probability conditioned over the full set of feature values.

The skew insensitivity of the measure can be read in the equation by the fact

Dans le document Modelisation and Information System Tools to Support the Discovery of Interactive Factors of Vulnerabilities in Life Courses (Page 71-81)