Assignment rules - Classification tree learning with infrequent outcomes

2.2 Classification tree learning with infrequent outcomes

2.2.5 Assignment rules

Another strategy to address class imbalance in classification tree learning is to act on the rule that assigns the final class to observations. Class assignment is done once the tree has been grown, for example with one of the splitting criterions previously studied. Each instance of a particular data set belongs to one (and only one) leaf of the tree. Thus, assigning a label to each observation is assigning a label to each leaf.

When a leaf is pure, there is no ambiguity to assign to the leaf the only class

2.2. Classification tree learning with infrequent outcomes 63 represented. When several classes are represented in the leaf, it becomes necessary to justify the choice of the target class.

Like in the tree growing process, the first step is to define exactly what we are expecting from our model. If we are looking for classifying new instances, a rule aiming to minimize the misclassification rate can be considered. If we are interested in analyzing profiles of a particular class, a rule targeting a specific class of the response variable, such as those based on off-centered measures, can be considered. If we are interested in identifying non-obvious assignment rules, the implication intensity index can be considered. Another strategy can also to decide to make no decision, but simply to display some informative measures allowing practitioners to analyze the situation and make their judgment on their own.

The most common strategy is to assign all observation of a particular leaf to the most frequent class. This assignment rule is called the “most frequent rule”. In a two-class classification setting, the rule is also called the “majority rule”. This assignment rule assumes that the criterion we want to optimize is the misclassification rate. However, assuming we want to increase the recall of the minority class in a two-class example where one class represents only 1% of the data set, it may turn out to be relevant to assign the minority class in a leaf where the class frequent by 30%. Moreover, using the majority rule, the minority class may never be assigned (see Figure 2.3), whereas in fact, it is the one we want to predict.

Here, I wish to place the analysis in the opposite case, that is to say where one class is significantly underrepresented among others. In this sense, several rules have been set up. These rules take into account the initial imbalance to make the decision. These rules are introduced in Marcellin (2008, p. 68). In the following, I will present only the main proposals. Note that in the binary case, these rules lead to the same decisions. I provide only a brief overview.

The threshold rule states that the class to be selected is the one that managed to exceed its initial proportion. In other words, we select the class i for which p_i> w_i. However, when we have more than two classes it might appear that two classes or more managed to exceed its initial proportion. Thus, this rule might be seen as a selector for candidate classes before using a well-defined assignment rule. In a binary context, this rule is well-defined because one and only one class can exceed it w_i. Note that this rule is consistent with the use of an off-centered entropy as a splitting criterion.

The idea of the maximal distance rule is to select the class which performs the best recall according to it initial proportion. This rule can be seen as an extension of the thresholds rule. The formula of the rule is given in Equation 2.25. Indeed, in a two-class context, the behavior of the rule is the same as the behavior of the threshold rule.

Γ(p) = arg max

i∈{1,...,`}(p_i−w_i) (2.25)

Another way to choose the class the most distant from its threshold is to use a ratio. According to Marcellin (2008, p. 74), this wording makes the rule more

sensitive to the minority classes. The formula of the maximal ratio rule is given in Equation2.26.

Γ(p) = arg max

i∈{1,...,`}

(pi/wi) (2.26)

Another assignment rule is the minimal entropy contribution rule. The idea behind this rule quite differs from the previous rule and is original. When the splitting criterion used to grow the tree is an information measure, the uncertainty of a node can be decomposed as the sum of the contribution uncertainty of the ` classes:

h(p) =

i=1

h^∗(p_i) (2.27)

whereh^∗(p_i) is the value of the contribution of the classito the uncertainty of the distribution. The less uncertain class is the one that has the minimal contribution to the entropy. The formula of the rule is given in Equation 2.28.

Γ(p) = arg min

i∈{1,...,`}(h^∗(p_i)) (2.28)

andh(pi) is the contribution of the classito the uncertainty ofp. This rule allows using the splitting criterion to decide the class to assign in a particular leaf. The selection is made consistently. Obviously, in a two-class context, the behavior of this rule is the same as previous rules.

Another strategy consists in using the implication index introduced in Section 2.2.4.2. Such an assignment strategy was firstly introduced by Zighed and Rako-tomalala (2000, p. 282-287) and then discussed in Ritschard (2005) and Ritschard et al. (2009b). Consider a nodeRand assuming that the expectation of the prac-titioner is to maximize the implication intensity, the formula of the implication intensity rule is given in Equation2.29.

Γ(R) = arg min

i∈{1,...,`}υˆ(R, yi) (2.29)

2.2.6 Conclusion

The second research question of the present work focuses on the identification of variable interactions under the assumption of class imbalance, due to infrequent outcomes. To this purpose, this Section reviewed, on one hand, standard classifica-tion tree learning approaches. Indeed, classificaclassifica-tion tree methods have been known for their ability to identify interaction effects. On the other hand, this Section re-viewed standard approaches designed to deal with class imbalance as well as several approaches designed specifically to deal with class imbalance in classification tree methods.

2.2. Classification tree learning with infrequent outcomes 65 This review showed that building a classification tree is a two-stage algorithm that starts with a tree induction process and ends up with a class assignment process. As the identification of interaction effects takes place in the tree induction process, the literature review focused more specifically on this first stage. In this stage, common approaches perform a recursive partitioning of the observations based on the values of the descriptor variables. Conventional methods use an axis-parallel strategy by splitting on a single descriptor at a time. These approaches can be distinguished between those based on statistical inference and those based on uncertainty reduction. The axis-parallel strategy can be extended to a broader set of tree methods called oblique trees. These methods perform splits according to a linear combination of the descriptors instead of a single descriptor. Oblique trees have theoretically the ability to construct trees that are smaller and more accurate than their axis-parallel counterparts, but it is achieved with an important increase in computational cost. As a consequence, and although various heuristics have been proposed to reduce computational cost, they are less popular.

With regard to class imbalance, the literature review showed that class im-balance leads to two main issues: a lack of observations that hampers the ability to explain the underlying probability model that make emerge class observations, that is called the absolute lack of data, and a learning bias toward the most frequents classes, called the relative lack of data. To address the class imbalance issue in classification, the common strategies are to balance data using resampling meth-ods, to balance cost errors by adjusting the error costs of each class according to their frequency, to use ensemble methods that naturally overcome data imbalance through aggregation or iterative strategies. Another strategy is to adapt the clas-sification method used to make it able to address data imbalance. With regard to classification trees based on uncertainty reduction, several off-centering strategies have been proposed to make entropy measures able to deal with class imbalance.

A practitioner that is interested in using a classification tree based on a statisti-cal criterion can use the implication index that naturally considers data imbalance.

The Hellinger distance is another strategy that naturally considers data imbalance.

However, the measure is designed to be used in a two-class problem with currently no extension for a multi-class response variable.

This literature review showed that the partitioning criterion used in the tree induction process can be adapted to match the exploration needs of the practitioner and to deal with class imbalance. This feature confirms the suitability of classi-fication tree methods in the context of studying vulnerabilities in the life course.

Now, the question that matters in the context of this study is the customization of tree methods and their practical use by practitioners to maximize the ability to identify relevant interaction effects. In the next Chapter, the proposition given by the conceptual model of the thesis is an attempt to address this question.

Chapter 3

Conceptual Model

3.1 A modelisation of vulnerability in life courses. . . 71 3.1.1 Connecting life course and vulnerability terminologies. . . 72 3.1.2 Vulnerability and vulnerabilization . . . 75 3.1.3 Underlying factors of vulnerability . . . 77 3.1.4 Manifest and latent vulnerability . . . 78 3.2 Exploring underlying factors in a semi-automatic way . . . 79 3.2.1 Mixing up selected weak and strong descriptors. . . 81 3.2.2 Interactive exploration . . . 82 3.3 Three off-centered measures designed for imbalanced data. . . 85 3.3.1 Residual effect induced by the off-centering operation . . . 86 3.3.2 The ratio entropy . . . 89 3.3.3 The MD score . . . 90 3.3.4 The MDC score . . . 91 3.4 Discussion. . . 94

My first research question, introduced in Section 1.2.2.1, focuses on clarify-ing the conceptual distinction between studyclarify-ing a particular undesired life situa-tion and studying vulnerability to this undesired situasitua-tion. For example, in what studying vulnerability to poverty is different than studying poverty. In addition, the boundaries of each of the concepts mobilized by the life course and vulnerability frameworks are not disjointed from each other. Thus, a same life situation can be referred by different terms depending on whether it is seen as a process of life course or a process of vulnerability. For example, the life situation “to be paid on a fixed-term contract” refers, in the fixed-terminology of the life course, to a possible role of the career trajectory. This same situation refers, in the terminology of vulnerability, to a resource level that provides some ability to moderate or alter the vulnerability of the individual. Therefore, unifying the life course and vulnerability frameworks by linking together the concepts they respectively mobilize allows on the one hand to show the capacity of these theories to be used together in a coherent way, and on the other hand brings more clarity for using a conceptual model that covers both frameworks.

Figure 3.1 –Process layer of the proposed information system formalized in BPMN (Chinosi and Trombetta,2012). Columns refer to the quantitative re-search process steps introduced in Section1.2.1and illustrated in Figure1.1.

Rows refer to software of the application layer that support researchers in doing the corresponding activities. The application layer of the proposed information system is introduced in Section5.2and illustrated in Figure5.3.

Therefore, I focus on the possibility of integrating the concepts and the termi-nologies of the life course perspective and of the framework of vulnerability together and putting forward a conceptual model of the process of the vulnerability in life courses. Such a model aims to provide researchers in information science with a holistic business model of vulnerability in life courses as well as the conceptual building blocks on which they can base operational solutions. However, the scope of the model is constraint by several simplifying assumptions. These assumptions were introduced in Section1.2.2.1. Several conceptualizations of the vulnerability in life courses emerged in the past decades, including Chambers (1989),

Schr¨oder-69

Table 3.1 –Modelisation of the proposed information system in 5 layers based on the layer model of Long´ep´e (2009).

Layer Description

Strategic

Refers to the identification of interactive factors of vulnerabil-ity in life courses. Such an activvulnerabil-ity is performed by a research team or a single researcher with at disposal literature reviews, datasets, statistical methods, and statistical software. My con-tribution to this layer is a conceptual model of vulnerability to a specific outcome.

Process

Refers to the process designed for the identification of interac-tive factors of vulnerability in life courses introduced in Figure 3.1. This process extends the general process of a quantitative research study is introduced in Figure1.1.

Functional

Refers to a research team in social sciences, composed of social scientists and data analysis practitioners. A single individual can also assume the roles of social scientist and data analysis practitioner.

Applicative

Refers to the set of four packages Rsocialdata, Rsocialdata.panel, Rsocialdata.network and Trim, as well as their dependencies such as R and the contributed packages imported by the packages. The applicative layer in described in Figure5.3and the class diagram of internal data storage structures is given in Figure5.4.

Infrastructure Refers to the personal computer of the social scientist or prac-titioner.

Butterfill and Marianti (2006), and Spini et al. (2013), but as Schr¨oder-Butterfill and Marianti (2006) and Spini et al. (2013) note, there is currently no consensus on a formal definition of the concept of vulnerability in life courses. In particu-lar, as discussed in Section 2.1.1, the life course perspective and the framework of vulnerability have each their own set of concepts. To address the first research question, I conducted a literature review on both the life course perspective and the framework of vulnerability. This literature review is presented in Section 2.1.

In Section3.1, I introduce my answer to this first research question by setting up a model of vulnerability to a particular outcome and how vulnerabilization processes diffuse in life courses.

My second research question, introduced in Section 1.2.2.2, focuses on the use of the classification tree method for exploring interaction effects. As I pos-tulate that, in modern societies, vulnerability situations are infrequent compared to desired life situations, I focus on exploring interaction effects in a context of class imbalance. To address this research question, I conducted a literature review on the growing of classification trees in an imbalanced data context. I answer the second research question by providing two complementary methodologies for

explo-Figure 3.2 – A conceptual model of vulnerability to a life course outcome.

Rectangular elements refer to resources. Individual resources are represented by a solid border. Contextual resources are represented by a non-solid border: for instance a dashed line or dotted line. Rounded rectangles refer to vulnerability components.

3.1. A modelisation of vulnerability in life courses 71 ring interaction effects in such a regard. The first methodology focuses on the exploratory process while the second methodology focuses on the tree growing pro-cess. The first methodology consists of the use of a preliminary bivariate analyses step to separate between strong and weak associations with the response variable and then to explore those associations by dynamically tuning the classification tree growing parameters through an interactive interface. In mixing up descriptors weakly associated with the response variable with descriptors strongly associated with the response variables, I expected the detection of underlying or unexpected interaction effects. This methodology is introduced in Section 3.2. The second methodology consists of better separating between vulnerability outcomes in an imbalanced data context. To this purpose, I put forward three classification tree growing measures designed to that end. These measures are introduced in Section 3.3.

These contribution are expected to be used jointly. Together, they consti-tute my response to the issue of supporting the discovery of underlying factors of vulnerability in life courses I address in this study. The conceptual model of vulnerability and vulnerabilization in life courses defines the concept of underlying factor of vulnerability and provides it with a theoretical anchoring. This definition involves the notion of interactions between variables representing life trajectories.

By focusing on the detection of interactions in an imbalanced data context, the exploratory strategies I put forward supports researchers in the identification of underlying factor of vulnerability. From an information system point of view, and based on the layer model of Long´ep´e (2009), the modelisation of the proposed so-lution is given in Table 3.1. In practice, when starting a study on vulnerability to a particular outcome, as for example vulnerability to poverty, the first step is to instantiate the model introduced in Section3.1. Instantiating the model means to define, for each concept involved in the model, to what concept it refers in the study. Once this theoretical activity is done, researchers can start the identifica-tion process of the underlying factors of vulnerability by using the methodologies introduced in Section3.2and3.3. I discussed the joint use of the contributions in much details in Section3.4.

3.1 A modelisation of vulnerability in life courses

In agreement with Weijer (1999), the objective pursued by my model is to be able to identify individuals who are sufficiently vulnerable to require protections above and beyond those available for all individuals. As Weijer (1999) notes, the tasks of deciding who are vulnerable and determining available and appropriate welfare are matters of judgment. In this context, my model aims to provide guidelines to help in making informed judgments.

Besides, as I use both the term framework and model in this Chapter, I take here the opportunity to clarify their meaning. From the definitions proposed by Schlager (2007) and Van Bon (2008), the term framework refers to a representation of the relations between every aspect of inquiry in regards to a scientific theory or research. A framework provides a metatheoretical language that can be used

to compare theories by using a common language. It attempts to identify the universal elements that any theory relevant to the same kind of phenomena would need to include. Fitting into a framework, a model is also a theoretical construct that represents a particular system using a set of concepts and the logical and relationships among them. A model formalizes some aspects of the framework in providing explanations for behaviors or predictions of outcomes. By design, a framework is a more general theoretical construct than a model. In particular, a model is intended to be tested and verified. However, not all models provide the same description level about the system it describes. According to the domain and the objective pursued, a model can refer to a formal description of some aspects of the physical and social world around us for understanding and communication (Mylopoulos,1992). In this case, the model conveys the fundamental principles and basic functionality of the system which it represents. However, a model can also be entirely developed by specifying all requirements, make explicit assumptions about parameters and defining quantitative relations among items, allowing the implementation of an executable model (Sokolowski and Banks, 2010). In this Chapter, I limit the scope of the work to the former definition of a model. That is to say, formalizing a model describing the principles and the dynamics of the vulnerability in life courses.

3.1.1 Connecting life course and vulnerability terminologies

Dans le document Modelisation and Information System Tools to Support the Discovery of Interactive Factors of Vulnerabilities in Life Courses (Page 81-91)