Interactive exploration - Exploring underlying factors in a semi-automatic way

3.2 Exploring underlying factors in a semi-automatic way

3.2.2 Interactive exploration

When building a classification tree, the practitioner selects a classification tree method and the data, and specifies the descriptive variables, the response variable, and some tuning/threshold parameters. The classification tree is trained according to the parameters defined and a grown tree is provided back to the practitioner.

This tree may, hopefully, provide the practitioner with relevant interactions. In this setting, the set of interaction effects identified were automatically extracted

3.2. Exploring underlying factors in a semi-automatic way 83 by the tree learning algorithm.

However, this automatic interaction effects extraction represents only a single view of the attribute space. By changing the set of variables used to grow the tree, the practitioner gets another selection of interaction effects that take place in another part of the attribute space. In addition, by changing the measure used in the tree growing process as well as the values of the tuning/threshold parameters, the practitioner gets complementary views of the associations and interaction effects that take place in data. Indeed, as a classification tree is grown recursively, the splits performed at leveln+ 1 depend directly on the splits performed at the level n. This means that a change in the first levels of the tree, done for example by adding/removing a strongly associated variable to the model, can drastically change the set of interaction effects that are provided to the practitioner.

It means that to conduct an exhaustive exploratory study of the potential underlying factors of vulnerability, a tool to more easily explore the various po-tential interactions effects that exist in the attribute space has to be provided to researchers.

In addition, classification trees are sensitive to small perturbations in data.

To illustrate this, consider two independent variables both strongly associated with the dependent variable. Such a situation can occur for instance when both variables are collinear. Naturally, two variables both strongly associated with the dependent variable are likely to produce splits of very close quality. Depending on sampling variations, one variable can be the best predictive one in a sample while the other variable turns out to be the best predictive one in another sample coming from the same population. As only one variable is used for splitting, the one leading to the best utility according to the growing measure used, the resulting tree computed in each sample will be different. Sensitivity to small perturbations in data also concerns the tuning/thresholds parameters used to grow the tree. To illustrate, let us consider the minimum gain threshold on the utility measure used to authorize an additional split. This parameter is the main parameter used when growing a tree. Let us suppose there exists a possible split of quality 0.951 and the minimum splitting quality threshold is set to 0.95. According to this setting, the tree method accepts the split. However, according to sampling variations, the utility of the same split in another sample coming from the same population might be assessed to 0.949. Under this setting, the tree growing process will either stop or choose another split. Therefore, the resulting tree will be different from the first one.

This sensitivity to sampling variations is especially significant in the case of classification tree models. Indeed, as the classification tree growing process is a greedy process that uses only the best variable and partition for splitting, and doing so recursively, a slight change in the first splits might result in a drastically different tree. Therefore, the set of interactions identified by the classification tree is likely to be affected by a slight perturbation in the data. In addition, there is often no strategy to orientate the choice of the tuning/thresholds parameters that are often set arbitrarily.

To overcome the sensitivity issue, some tree growing measures are based on a statistical test. The tree growing measure is associated with a theoretical

distribu-tion, and ap-value is computed to assess the statistical significance of a particular split. This strategy is for example used in the CHAID method for which the tree growing measure follows a Pearson’s chi-squared distribution (Kass, 1980). How-ever, in the process of selecting the best variable and the best splitting, multiple testing occurs. To account for it, a multiple-testing adjustment strategy should be set up (Bender and Lange, 2001). Two common strategies for multiple com-parisons are the Bonferroni’s p-value adjustment (Bonferroni, 1935, 1936; Bland and Altman,1995) and the Holm’s procedure (Holm,1979).⁴ When the tree grow-ing measure does not follow a known theoretical distribution, a strategy is to use a bootstrap-based multiple testing adjustment (Stelzer et al., 2013). However, as shown in Table2.1, most of the tree growing measures considered here are not based on a statistical criterion. It is particularly the case for entropy-based measures.

To overcome the exhaustivity of the exploration issue and the sensitivity to the training data issue, my proposition is to growth classification trees through a parameterized interactive interface. Therefore, instead of being automatic, the dis-covery of interaction effects is a semi-automatic process: the exploration is guided by the practitioner who will control and act on parameters to explore the data.

Once the practitioner defines a set of parameters, a new automatic extraction is performed by the classification tree method.

The use of interactive graphical interfaces for growing classification trees is already proposed by several statistical software but with a different scope. For growing classification trees such as CART, QUEST or CHAID, the SPSS software (IBM Corporation, 2013a) provides an option to start an interactive session that allows the user to build the tree by level. The user can at each level manually edit and prune the splits of the tree to end up with a manually customized model.

The motivation of letting the users controlling the tree growing process is allowing them in applying their business knowledge to refine or simplify the tree. The SAS Enterprise Miner software provides a similar functionality but limited to the CHAID method. An interactive mode allows users to force the variable to use for splitting, to define split values, and to prune branches and leaves manually (Rush,2014). In theRsoftware, theprp()command from therpart.plotpackage (Milborrow,2016) allows starting an interactive session for manually pruning a tree.

The user can click on the unwanted nodes to kill them. However, it is not possible to act on the parameter values used to grow the tree. Still, in the R software, the collapseTree()command from the phytoolspackage (Revell, 2012) allows starting an interactive session that allows to collapse and expand the subtrees of a tree to adapt the depth of the tree to user’s needs.

Therefore, the scope of these various interactive tree growing implementation is to let practitioners manually decide of the splits and the complexity of a single tree. The objective is to refine a tree model to adapt it to practitioners need. This motivation is different from the motivation of exploring the attribute space to find sound underlying interaction effects.

To meet the objective of exploring the attribute space to find sound underlying

4Although frequently used, Bonferroni’s adjustment is criticised for insufficient conceptual foundations and a loss of power (Perneger,1998). There is now strong evidence that the Holm’s procedure should be used instead (Aickin and Gensler,1996).

3.3. Three off-centered measures designed for imbalanced data 85

Dans le document Modelisation and Information System Tools to Support the Discovery of Interactive Factors of Vulnerabilities in Life Courses (Page 101-104)