• Aucun résultat trouvé

5.2 Software architecture

5.2.5 The package Trim

net.leger_link_son_daughter_xTOy[1:10]

## [1] NA NA FALSE NA TRUE NA NA NA NA NA

In the result, the value TRUE is reported when daughters provide sons with emo-tional support and the valueFALSE is reported in the opposite case. The valueNA is reported when there is neither any son (S) nor any daughter (D) in the network.

Notice that, to be performed, this operation requires to link demographic informa-tion about network members as well as informainforma-tion about the relainforma-tionship structure of the network. This operation is integrated within the net query operator and therefore spares the user several data management operations.

5.2.5 The package Trim

The packageTrim(TRees for IMbalanced data) aims to provide an implementation of both the main off-centered classification trees introduced in the literature review as well as the methods I introduced as new contributions in Section 3.3. Like the Rsocialdata package, its goal is twofold. On the one hand, it aims to allow a comparison of the methods. On the other hand, it aims to make the classification tree methods available to the practitioner who would like to use them in a study involving an imbalanced distribution of the response variable. To satisfy this sec-ond purpose, front-end functions allowing to use the tree methods available in a convenient way and with limited misuse risks are provided. Finally, the package is designed to give the possibility to users to use they own entropy functions and assignment rules to build they own classification tree methods. To control for over-fitting when adjusting parameters, a k-fold cross-validation method is provided.

Currently, the package does not support missing values. Cases containing at least one missing value are removed before inducing a tree.

5.2.5.1 Package design

To date, there is no native structure inRfor implementing classification trees. The multiple packages available on the CRAN implementing tree methods use different solutions with relatively little reuse of code. Recently, Hothorn and Zeileis (2015) put forward the package partykitthat aims to mitigate this issue by providing a common unified infrastructure for recursive partitioning in the R software. In particular, partykit provides tools for representing, printing, plotting trees and computing predictions. The various methods available can be used to develop both classification trees and regression trees. An illustration of how to use the methods

practically is brought to the user by providing a reimplementation withinpartykit of the conditional inference tree and other recursive partitioning methods initially released inparty (Hothorn et al.,2006b).

The package Trim can be seen as an extension of the package partykit in bringing the creation of classification tree methods closer to the user. While the package partykit provides the building blocks for designing tree methods, the package Trim provide a fully working classification tree skeleton. This skeleton implements all the essential features required in the classification tree induction process, such as the discretization of quantitative variables, the generation of all the possiblej-ary splits of a nominal and ordered categorical variable, the recursive partitioning, and the minimal stopping criteria such as stopping when there is no more descriptor available or when a node is pure. In addition, and as the key feature of the package, the goodness assessment of a particular split and the class assignment rule – the elements that constitute the core of a classification tree method – are customizable by the user. Regarding the goodness assessment of a particular split, I decided to distinguish between the goodness assessment of a particular node, and the goodness gain assessment. The goodness of a particular node can be assessed, for example, by an entropy measure (taking in input a discrete probability distribution) or an empirical entropy measure (taking in input a discrete probability distribution as well as some parameters to adjust probabilities in regards to a probability estimation process, such as the number of cases). The goodness gain can be assessed, for example, by an entropy reduction measure such as the information gain or the gain ratio. Currently, the entropy reduction measure is hard-coded and therefore not customizable.

Table??shows the main internal methods of theTrimpackage. The method allows to first check input data and eventually coerce it acceptable data for-mat, then data are cleaned to be ready to use in a tree induction process. The initialization of the induction process is done by the buildTree method. The search of the best split is performed by the nested methods findBestAttribute, findBestSplit, and computeSplitQuality. Once the best split is found, the buildSplitmethod creates the split and fills it with the corresponding child node created by thebuildNodemethod.

5.2.5.2 Available methods and practical examples

Table 5.11 lists the user methods provided by the package. Table 5.12 lists the goodness measures that were implemented to date in the packageTrimfor assess-ing node quality. Table 5.13lists the class assignment methods provided by the package. To illustrate the use of the package, I use the “Saturday morning” dataset introduced in Quinlan (1986). This dataset describes several Saturday mornings weather conditions by using four descriptive variables and indicates whether the mornings are suitable for doing some unspecified activity.

5.2. Software architecture 167

Table 5.11 –Main front-end methods provided by theTrimpackage

Method Description

create.tree.metacontrol

Create a well-formed named list of relative tree con-trol parameters such as relative values of increase in split quality, minimal number of cases in parent nodes and child nodes, that can be used accross mul-tiple datasets.

create.tree.control Create a well-formed named list of absolute tree con-trol parameters given a list of relative tree concon-trol parameters and a dataset.

resample Resample a labelled dataset to match a given distri-bution.

balance Balance a labelled dataset by “subsampling” or

“oversampling” according to the degree of imbalance.

The method is a wrapper of theresamplemethod.

tree.learn Generic method to learn/train a tree.

tree.crossValidation Compute ak-folds cross-validation for a tree.

tree.score Interface with the packageROCRfor scoring instances according to a tree and plot a ROC curve.

tree.predict Predict the class label of new instances according to a tree classifier.

Table 5.12 –Main node goodness measures provided by theTrimpackage

Method Description

entropy.Shannon The Shannon entropy (Shannon,1948).

entropy.asymmetric.Zighed The off-centered entropy introduced by Marcellin (2008), Zighed et al. (2010), and Ritschard et al.

(2009b).

entropy.offCentered.Lallich The off-centered entropy introduced by Lallich et al. (2007).

entropy.offCentered.ratio The off-centered entropy introduced in Section 3.3.2.

score.offCentered.MD The MD off-centered score introduced in Section 3.3.3.

score.offCentered.MDC The MDC off-centered score introduced in Sec-tion3.3.4.

We start by loading the packageTrimin the workspace.

library(Trim)

Then, we load the “Saturday morning” dataset. This dataset is embedded in the package Trimto allow users practicing with the following examples.

data(weather.nominal)

Table 5.13 –Main assignement rules provided by theTrimpackage

Method Description

rule.majority Implementation of the majority rule.

rule.maxRatio Implementation of the maxRatio rule.

rule.minContributionEntropy Implementation of the minimal contribution to entropy rule.

rule.thresholds Implementation of the thresholds rule.

Before training a tree, we need to separate the set of descriptive attributes and the target variable.

xweather <- weather.nominal[-5]

yweather <- weather.nominal$play

Then we train a first tree by using the command tree.learn with its default parameters.

learn1 <- tree.learn(

X = xweather, y = yweather )

We plot the resulting tree with the command plot. The graphic output is given in Figure5.12.

plot(learn1$tree)

outlook 1

overcast rainy sunny

terminal node 2 : yes

windy 3

FALSE TRUE

terminal node 4 : yes

terminal node 5 : no

humidity 6

high normal

terminal node 7 : no

terminal node 8 : yes

Figure 5.12 –Illustration of tree grown with theTrim package by using the Shannon entropy.

In the previous example, we used the default parameters of the commandtree.learn.

These defaults parameters are: (1) the use of the Shannon entropy to assess node

5.2. Software architecture 169 quality, (2) the use of the Gain Ratio to assess the entropy reduction induced by a split, (3) no resampling of data, (4) the default tuning parameters defined by the tree.metaControlcommand and (5) the use of the majority rule for performing class assignment. How to change the node quality measure used to grow the tree is illustrated in Section5.3.3. Below I illustrate how to change the tuning parameters and the assignment rule.

The set of tuning parameters to be used to grow the tree are defined in a tree_metaControlobject. Such an object is created with thetree.metaControl command. The default parameters used by this command are: (1) a minimum of instances required in a leaf equals to 2, (1) a minimum entropy reduction equals to 0.01, and (4) a maximum tree depth equals to 3. For testing purposes, the imbalanced data consistent (IDC) stopping criteria introduced in Section3.3.4can be enabled through this command by using the parameter minbucketFitToSkew

= TRUE. When computing a tree, only the parameters that differs from the default parameters have to be submitted. For example, to change the minimum number of instances in leaves to 5 while keeping all the other default parameters, the user can use the following code:

learn2 <- tree.learn(

X = xweather, y = yweather,

metactrl = tree.metaControl(minBucket = 5) )

We plot the resulting tree with the command plot. The graphic output is given in Figure5.13. It shows that to satisfy the constraint set on the minimum number of instances in leaves, the growing process stopped at the second level of the tree.

plot(learn2$tree)

humidity 1

high normal

terminal node 2 : no

terminal node 3 : yes

Figure 5.13 –Illustration of tree grown with theTrim package by using the Shannon entropy and a minimum number of instances in leaves of 5.

To set the class assignment rule to use when performing class assignment in the leaves of the tree, the user can use theruleListargument. This argument expects to receive a list with the function to use to perform class assignment as the first list element and the optional parameters required by the function in the remainder elements of the list. For off-centered class assignment rules, one of the optional parameters will be the empirical distribution associated with the situation to avoid, called the “worst distribution”. The code below illustrates how to use the maximal ratio rule introduced in Section2.2.5:

learn3 <- tree.learn( X = xweather, y = yweather, ruleList = list(

rule.maxRatio,

worstDistribution = c(0.26, 0.74) )

)

We plot the resulting tree with the commandplot. The graphic output is given in Figure5.14. It shows that the class assignment performed by using the maximum ratio rule gives the same classification than the one obtained in Figure 5.12 by using the majority rule.

plot(learn3$tree)

outlook 1

overcast rainy sunny

terminal node 2 : yes

windy 3

FALSE TRUE

terminal node 4 : yes

terminal node 5 : no

humidity 6

high normal

terminal node 7 : no

terminal node 8 : yes

Figure 5.14 –Illustration of tree grown with theTrim package by using the Shannon entropy and the maximum ratio assignment rule.