Overview of Machine Learning Algorithms - Prediction Experiments Preparation

4. Predictions Feasibility Assessment

4.6 Prediction Experiments Preparation

4.6.4 Overview of Machine Learning Algorithms

Algorithms

As the goal of our study is a feasibility assessment of delay predictions, we select multiple different machine learning algorithms and we aim to compare their models’ prediction accuracy. We select different machine learning algorithms - with clearly distinct bias and variance characteristics.

Algorithms that have a high bias generate simple, highly constrained models, which are quite insensitive to data fluctuations, so that variance is low. These are not likely to over fit the training data. The algorithms that have a high variance can generate arbitrarily complex models which fit data variations more readily. These are very likely to over fit the training data.

The names of the algorithms come after names used in the WEKA software package (Witten and Frank 2005).

Among the algorithms with distinct bias and variance characteristics, we have selected for our study, the algorithms were grouped in two groups:

Table 4-4

Predictive variables for KA-RTT classification

90 PREDICTIONS FEASIBILITY A^SSESSMENT

A. logic-based algorithms that are high-variance algorithms building their models based on (possibly complex) decision trees and rules, by considering relations between predictive variables sequentially, i.e., on a variable per variable basis:

1. J48 (Witten and Frank 2005) decision trees and PART (PA) rules (Frank and Witten 1998), both variants of C5.0 tree and C5.0 rules respectively (Quinlan 1993); where PART derives rules from partial decision trees using J48, namely, in each iteration it takes a tree and transforms its best leaf (i.e., leaf with a highest prediction accuracy) into a rule;

2. JRip rules induction, implementing the Repeated Incremental Pruning to Produce Error Reduction (RIPPER) learner; the target variable classifications are examined in an increasing size, an initial set of rules for the given class is generated using incremental reduced error-pruning (as each rule needs to contribute to reduction of description length), hence the algorithm performs global optimization of rule sets (Cohen 1995));

B. density-estimation based algorithms that are high-bias algorithms, building their models by considering a relation between all predictive variables in parallel:

3. Naïve Bayes family of algorithms (NB (John and Langley 1995)) modelling casual dependencies between variables (Charniak 1991;

Spiegelhalter, Dawid et al. 1993; Heckerman 1996). This family of algorithms has a high bias, because it assumes that the learning dataset can be summarized by a single probability distribution and that its model is sufficient to discriminate between classes.

This set of basic algorithms was completed by the Random Forest (RF) ensemble algorithm (Breiman 2001). Ensemble algorithms build multiple models and classify new instances by combining the decisions of the different models (usually via some form of weighted voting). The RF algorithm uses decision trees as base algorithms for learning. It selects predictive variables instead of learning dataset instances. At each node, RF randomly draws a subset of K (a user-defined complexity parameter) predictive variables and then selects the test variable from this typically much smaller subset.

In our research we have also experimented with non-linear, high-bias machine learning algorithms, like neural networks (Multilayer Perceptron Neural Network (NN)) and support vector machines (SVM called also sequential minimal optimization (SMO)), as well as with distance solutions (called ‘lazy’ (Alpaydin 2004)) like k-nearest neighbour (kNN). We have very early in our research discarded these algorithms because they have extremely long learning and predictions times (in an order of days) and models derived with these algorithms did not exhibit an advantage in their

P^REDICTIONEXPERIMENTS P^REPARATION 91

prediction accuracy over the ones we retained for the further study.

However, the results of this part of research are not reported in this thesis.

Details on each of the machine learning algorithms mentioned in this section can be found in, e.g., (Duda, Hart et al. 2000; Jain, Duin et al.

2000; Alpaydin 2004).

Algorithms Complexity Parameters

In order to build a set of models with different accuracy, algorithms should be assessed in a variety of user-defined complexity parameter settings. Again, as the goal of our study is a feasibility assessment of delay predictions, to choose the complexity parameters for algorithms, we assumed a random walk in a parameter space approach.

From the set of candidates described above, only the high-bias algorithm, i.e., NB family, has no complexity parameters; however NB family has algorithm variants based on whether continuous predictive variables are discretized (D) or not, in which measurement instances probabilities are computed either by assuming normality or via kernel-based density estimation. An additional variant of algorithm in NB family we consider is NB Simple (NBS) that does not discretize continuous predictive variables but assumes their normality. For all the remaining algorithms, we select a number of user-defined complexity parameter settings, as follows.

The main complexity parameter of recursive partitioning algorithms (J48, PA) is the C parameter, which governs the amount of post-pruning performed on decision trees and rules. The lower the value of C, the less complex the tree is (more pruning). Its default value is C = 0.25. We tried values of C = 0.15, 0.25 and 0.35. Moreover, we instructed the algorithm to use (i.e., via its parameter B) or not to use binary splits on categorical predictive variables when building the trees.

The complexity parameters for JRip rules governs the minimum total weight of the measurement instances in a rule (N), the number of optimization runs (O). We tried values of N = 5, 10, 100 and 200 (for the weight of an instance is being 1) and values of O = 2 and 5. For all combinations we have instructed the algorithm whether to prune the rules (P) or not.

The RF algorithm is governed by two main parameters: the number of trees which form the ‘committee of experts’ (I) and the number of predictive variables (K) to select for each tree. We explored combinations of I = 10, 50 and 100 and K = 2, 3, 6 and 10.

In total we investigated 47 different algorithms with their distinctive complexity parameters, as enumerated below. For the parameters list below, we point to an option assigned in WEKA software package (Witten and Frank 2005), whenever necessary.

92 PREDICTIONS FEASIBILITY A^SSESSMENT

1. J48 and PA

- C=0.15, 0.25, 0.35: confidence factor for pruning, the lower the value, the more pruning (-C),

- B=FALSE/TRUE for using binary splits on nominal predictive variables when building the trees (-B),

- 2: minimum number of predictive variables per a leaf,

- 3: number of folds into which learning-dataset is split; two folds for growing tree and one for pruning,

- TRUE: make sub-tree rising when pruning,

- FALSE: smooth counts on leaves based on Laplace approximation.

That results in the following twelve algorithms being considered for J48 and PA: "J48 -C 0.15", "J48 -C 0.25", "J48 -C 0.35", "J48 -C 0.15 -B",

"J48 -C 0.25 -B", "J48 -C 0.35 -B", "PA -C 0.15", "PA -C 0.25", "PA -C 0.35", "PA -C 0.15 -B", "PA -C 0.25 -B" and "PA -C 0.35 -B".

2. JRip

- 2, 5, 10, 100, 200: the minimum total weight of the instances in a rule (within a split); if each item has as a default weight of 1, a new rule (a split) is considered to be created only if it covers accurately at least 2, 5, 10, 100 or 200 instances; the higher the N, the less rules are made; N is a function of total number of instances (-N),

- 2, 5: number of optimization runs (-O), - FALSE/TRUE: use pruning,

- TRUE: check for accuracy level (< 50 %) as a stopping criterion, - 3: number of folds into which learning-dataset is split; two folds for

growing tree and one for pruning.

That results in the following twenty algorithms being considered for JRip:

"JRip -N 2 -O 2", "JRip -N 5 -O 2", "JRip -N 10 -O 2", "JRip -N 100 -O 2",

- FALSE/TRUE for using supervised discretization (-D), - Naïve Bayes Simple (NBS).

That results in the following three algorithms being considered for NB:

"NB -D", "NB" and "NBS".

4. RF

- 10, 50, 100: number of trees to be generated (-I),

- 2, 3, 6, 10: number of predictive variables used in random tree selection (-K).

That results in the following twelve algorithms being considered for RF:

"RF I 10 K 2", "RF I 10 K 3", "RF I 10 K 6", "RF I 10 K 10", "RF

-P^REDICTIONEXPERIMENTS P^REPARATION 93

I 50 -K 2", "RF -I 50 -K 3", "RF -I 50 -K 6", "RF -I 50 -K 10", "RF -I 100 -K 2", "RF -I 100 -K 3", "RF -I 100 -K 6" and "RF -I 100 -K 10".

For all algorithms, parameter for a random seed value was set to S=1.

In the rest of this thesis, when using a term “machine learning algorithm”

or simply “algorithm” we refer to a given algorithm with given complexity parameters.

Dans le document Collaborative sharing of quality of service-information for mobile service users (Page 90-94)