• Aucun résultat trouvé

2.2 Classification tree learning with infrequent outcomes

2.2.3 Standard approaches for dealing with class imbalance

at the data level (2) the other is to work at the algorithmic level. At the data level, a common strategy is to balance the original dataset by using resample. The options are to oversample the minority class, to under-sample the majority class (Kubat and Matwin, 1997; Japkowicz, 2000a; Ling and Li, 1998) or doing both (Chawla et al.,2002). At the data level, another strategy is to assign distinct costs to training examples.

At the algorithmic level, ensemble methods are a strategy that naturally con-verge to most difficult cases to classify, that are in an imbalanced data context the less frequent classes. Another strategy is to design an ad hoc strategy of a spe-cific classifier studied, for example naive bayse, logistic regression, or classification trees. As the thesis focuses on classification tree methods, we review the methods specifically designed for inducing classification trees in an imbalanced data context in Section2.2.4.

2.2.3.1 Resampling methods

At the data level, the strategy is to rectify the imbalanced by using resampling methods. In a binary context we have two choices: (1) undersampling the majority

class or (2) oversampling the minority class.

Undersampling consists in removing some instances of the majority class.

Obviously, fewer instances mean less information and may lead to poor learning.

Furthermore, by removing some instances of the majority class, we may remove some important instances to define the concept of the majority class. Some methods exist to remove firstly less informational instances (Kubat and Matwin, 1997).

Another interesting way may be to attempt to remove instances which are close to instances of the minority class. By enlarging the margin between both classes, we can expect to a better recall of the minority class (Thomas et al., 2008). This kind of methods is called handed resampling methods (Breiman,2001; Zhang and Mani,2003).

On the other hand, oversampling consists in increasing the number of stances of the minority class. Obviously, increasing the number of instances in-creases computation time. A quite naive method consists of randomly replicating instances. But obviously giving more importance to the same information may lead to overfitting. More sophisticated methods have been set up. Nickerson et al.

(2001) put forward a method aiming to detect infrequent cases existing in each class by a supervised method and oversamples them so each case has the same number of instances.

One of the most popular methods is SMOTE (Chawla et al.,2002) which arti-ficially creates instances to fill out the neighborhood of the minority class instances in the space of descriptive variables.

However these resampling approaches leave open much to interpretation:

should the negative class be undersampled, the positive class be oversampled, or should a combination be used to reach the balance point? To address this question, (Chawla et al.,2008; Cieslak and Chawla,2008) suggest to search a larger sampling space (which includes several potential balance points) using a wrapping method that guides the identification of the optimal class proportions.

In fact, performances between under and oversampling vary from one dataset to another. Nevertheless, Barandela and Hulse recommend to subsample when the imbalance is low (i.e., less than 10 % according to them) and an oversampling when the imbalance is strong (Barandela et al., 2003; Van Hulse et al.,2007). However, it’s important to notice that Weiss et Provost showed that the class balance does not necessarily produce the best results (Weiss and Provost,2003).

2.2.3.2 Cost methods

The idea of the first strategy is to set unequal and fixed costs on the different classification errors (Pazzani et al.,1994). Sometimes costs are given. For example, costs could refer to the price paid by a company for each misclassification. But in most cases, practitioners have to set them by hand. For example, one can set the cost of misclassifying an instance of the majority class as 1 whereas the cost for the minority class will be set as 4. Most often, the cost for a correct classification is set as zero. Let cm be the misclassification cost for the minority class andcM be the misclassification cost for the majority class. The misclassification cost of a

2.2. Classification tree learning with infrequent outcomes 51 given error distribution is given byC=cMpM+cmpm. Then, instead of trying to minimize the classification error rate during the learning process, we seek for minimizing the misclassification cost. A more general approach is put forward by (Domingos,1999). The method developed, called MetaCost, allows any supervised learning algorithm to be cost sensible. The method is based on a bootstrap and re-labeling method. Cost methods can be used jointly with resampling methods.

Elkan discusses the interaction of cost and class imbalance (Elkan,2001), proposing a simple method to calculate optimal sampling levels.

A legitimate question to ask is under what situations it’s relevant to use such or such method. Weiss (2004) offers a detailed article on this subject. Moreover, he offers a comparative study between the cost-sensitive approach and oversampling or subsampling methods (Weiss et al.,2007). He concludes that no method always dominates the others. However, according to some characteristics of the data set, one or the other choice may be better. He concludes that on large data sets (over 10 000 observations) cost-sensitive methods consistently outperform resampling methods while oversampling appears to be the best method for small data sets.

2.2.3.3 Ensemble methods

Ensemble methods are a kind of methods consisting in training several models and aggregating them together to improve prediction accuracy. Considering classifica-tion trees, that suffer from an inherent instability, since due to their hierarchical nature the effect of an error in the top splits propagates down to all of the splits below, ensemble methods are also an effective strategy to gain stability (Abell´an and Masegosa, 2009).

Boosting (Becker and Sch¨urmann, 1972; Haussler,1989; Schapire,1990) is a generic strategy which consists in training a series of classifiers. The training set used for each member of the series is chosen based on the performance of the earlier classifier(s) in the series. One of the most popular boosting algorithms is AdaBoost, short for “Adaptive Boosting”, introduced by Freund (1995). Adaboost assigns different weights to each instance. After each iteration, weights of misclassified instances increase whereas weights of correctly classified instances decrease. At the next iteration, the classification process targets much more instances of high weight. Instead of acting on weights, another strategy is to act on the probability selection in the training set. For example, in arcing, this probability depends on how often an example was misclassified by the previous classifiers of the series (Breiman, 1998; Opitz and Maclin,1999).

The imbalanced data issue can be address using a boosting strategy by fur-ther increasing the weights of the minority class instances. As the instances of the minority class are harder to classify correctly, they will naturally see they misclas-sification cost increase at each iteration. An example is AdaCost (Fan et al.,1999).

Another way to proceed is to use the true positive and true negative rates to set weights. This work is done by (Joshi et al.,2001) with RareBoost. If the number of true positive instances exceeds the number of false positive ones, weights of cor-rectly classified instances decreases. Respectively, if the number of true negative instances is higher of that of false negative ones the weight of misclassified instances

increases. However, increasing the weight of minority class instances can lead to overfitting. Therefore, Chawla proposed SMOTEBoost (Chawla et al.,2003) which adds new instances in the SMOTE method instead of simply increasing the weight of minority class individuals.

Bagging (or Bootstrap Aggregation) (Breiman, 1994, 1996) is another en-semble method which consists of creating random sampling (bootstrap) with re-placement, train a classifier on each and combine them. A basic bagging method combines these classifiers by assigning to each instance the most predicted label by these. This method improves classification accuracy. It also helps to avoid over-fitting. Breiman puts forward a bagging method for classification trees, so-called random forets (Breiman, 2001, 2002). The main idea is that each tree is induced on a random subset of the features. To deal with imbalanced data two methods exists: (1) Balanced Random Forest (BRF) taking the same number of instances in the minority and majority class in the bootstrap step and (2)Weighted Random Forest (WBF) consisting in the use of a cost sensible method for the learning step (Chen et al.,2004).