Setting the baseline - Experimental settings

3 Binary Classification

3.4 Experimental settings

3.4.1 Setting the baseline

Tuning a machine-learning algorithm is a complex process as many parameters influence the outcome of the classification. Obviously, we have to choose a learning algorithm, but this is only one of the many parameters to be considered. Indeed, the features selection method, the weighting schema and specific algorithm parameter also strongly influence the outcome of the classification.

Finding the best performance for our problem by making an exhaustive search with every combination of parameters would be too time-consuming. We cannot test all the combinations of parameters. Therefore, we perform a limited number of experiments that help us to skim through the investigated space. Instead of spending lot of effort in searching the best parameters among the full space of parameters(#𝑃𝑃1∗ #𝑃𝑃1∗ … #𝑃𝑃𝑛𝑛), we split the search space by tuning the parameters independently 𝑃𝑃1(#P1+#P2+….#Pn).

The parameters are selected for being tuned as follows:

1) We tune the internal parameters of the classification algorithm first, 2) Then, we choose the best features selection algorithm,

3) Finally, we select the most suitable weighting schema.

The experiments are performed separately from the training and the test set. In order to evaluate the performance on the training set, we perform a 10-fold cross-validation. In the case of the evaluation of the test set, we build the model, based on the complete training set, and we then apply the model to the test set.

3.4.1.1 Classification Algorithm

There are many classification algorithms. However, not all of them are relevant to perform text classification. A quick review of the literature shows that only a small sub set of classifiers provides suitable results on a text classification task. We choose to perform our experiment using a Support Vector Machine (SVM) algorithm because it is a state-of-the-art algorithm, which has already shown satisfactory results in literature for similar tasks.

3.4.1.1.1 Search for optimal parameters on LibSVM

Several parameters can influence the performance obtained with a SVM. The first one is the type of kernel that is employed. We have not performed deep research on the mechanism of the SVM;

therefore, we decide to use one of the most employed kernels, the radial basis function (RBF) kernel.

We define the problem of SVM as follows:

𝑚𝑚𝑖𝑖𝑛𝑛1

2‖𝑤𝑤‖²+ 𝐶𝐶 � 𝜍𝜍i 𝑖𝑖

(3)

The member at the left represents the margin and the one on the right the classification error. In order to minimize the error, the bigger the C, the lower must be the sum of the sigma. The C parameter indicates the number of outlier allowed. It can also be seen as the penalty parameter of the error term. The bigger the C, the fewer errors are authorized, and therefore, in a case where the data is not separable, the smaller the margin. Contrarily, the smaller the C, the more errors are

tolerated and the larger the margin. As we do not know beforehand which C and

γ

are the most appropriate for one problem, some kind of model selection (parameter search) must be done. In order to discover the best value for these two parameters, we vary these two parameters in their respective domains to see which combination produces the maximum value. This method is known as a grid search. The grid search is straightforward but seems stupid. In fact, several advanced methods can save computational cost, for example, by approximating the cross-validation rate.

However, there are two reasons why we prefer the simple grid search approach. One is that psychologically we may not feel safe to use methods that avoid doing an exhaustive parameter search by approximations or heuristics. The other reason is that the computational time to find optimal parameters by grid search is not much more than by advanced methods, since there are only two parameters. Furthermore, the grid search can be easily parallelized because each (C;

γ

) is independent. Many of the advanced methods are iterative processes, e.g. walking along a path, which might be difficult for parallelization.

We decided to tune these two parameters as an initial step, because it is a very resource-consuming process. For every different combination of parameters, we have to construct a new model and to test it. Moreover, as we perform cross-validation in order to evaluate the performance of the training set, the experiments are running in terms of days.

Figure 10 Accuracy given C and epsilon parameter variation on the training set

The Figure 10 shows the accuracy obtained while varying the C and gamma parameters. The maximum is obtained with a C of 3 and gamma of 0.0009. We notice that if the parameters are not chosen properly, the performance obtained is the one of the initial class distribution. Even if it is not certain that the C and gamma value remains optimal if we change another parameter of the learning process, from this point forward, we will employ these values for our experiments. Moreover, as we are more interested to analyze the gain or loss of performance by applying a specific method rather than searching for the absolute best accuracy possible, keeping stable parameter values is the most appropriate way to perform the comparisons.

3.4.1.2 Balancing the data

In a learning process, when the proportion between positive and negative instances is strongly imbalanced, the classifier tends to classify every new unseen instance with the class that is the most represented in the training data. Such situation is not problematic if the distribution in the test set is similar to the one of the training set but there is no evidence that it is the case. If the class distribution changes in the test data, the “majority rule” defined on the training set is no longer adapted to this new set of data and can lead to misclassification. In order to avoid such phenomenon, one can balance the data. Balancing the data reduces the influence on the classifier of the initial distribution and creates a more adapted model to classify previously unseen instances.

In our problem, among the 5,495 available training instances, 3,536 (64%) are positive and 1,959 (36%) are negative. This distribution is almost well balanced; however, it is still interesting to check if such distribution has an influence on the performance. As the positive instances are more frequent, the classifier will tend to favor the positive class.

We attempt to balance our distribution using two of the most common approaches: the resampling method and the use of a cost matrix. In the resampling approach, data is balanced by oversampling the under-represented instances or undersampling the over-represented instances. In particular, we compute the mean between the number of positive and negative instances, increase the number of under-represented instances to this threshold and reduce the number of over-represented instances to this threshold. The second approach consists of increasing the cost of missclassifying the instances belonging to the under-represented class. By doing so, the initial tendency of the classifier to classify the new instances under the more represented class will be moderated, given the important cost of misclassification induced. In our specific problem, we increase the misclassification cost of false positives. By doing this, we influence the classifier to be more careful about classifying an instance as positive and thus avoid favoring positive instances. A simple way to compute the misclassification cost it is to look at the ratio between the positive instances and the negatives ones in the training set.

In our case, the ratio is of 1.7 as there is 1.7 more time positive instances than negatives.

Balancing strategy Training set Test set

Accuracy F-measure Accuracy F-measure

No 93.0 92.2 74.7 74.2

[0, 1;1.7, 0] 92.3 91.6 73.1 73.1

Resampling 91.7 91.0 72.1 72.2

Table 6 Performance obtain by using or not a cost matrix for the classification

We observe in Table 6 a loss of accuracy induced by the different strategies employed to balance the ratio between the over-represented and the under-represented instances. It seems to indicate that the classification algorithm manages naturally this unbalanced distribution.

3.4.1.3 Results of Feature Selection

Performing feature selection is a clever way to reduce the complexity of a classification problem by suppressing all the features that do not help to make a decision. Three different feature selection strategies are tested:

- No feature selection

- Information gain feature selection - Chi2 feature selection

Feature selection algorithm Remaining features

Training set Test set

Initial 41957 11988

No (stop word removal) 41653 11901

Information gain 3073 1723

Chi2 3070 1720

Table 7 Number of selected feature on the training and test set following the feature selection process

Feature selection Training set Test set

Accuracy F-measure Accuracy F-measure

No NOT ENOUGH MEMORY 74.5 74.4

Information gain 93.0 92.2 74.4 74.2

Chi2 93.0 92.2 70.0 70.0

Table 8 Accuracy and F-Measure obtain by a classification on the feature space obtained following the feature selection process

Table 8 shows that the best results are obtained when no feature selection is performed. This reminds us of the observation done by (Brank et al., 2002; Rogati & Yang, 2002). They noticed on their experiment that the performances on SVM were reduced by the use of a feature selection algorithm. However, even if the performances obtained without feature selection are the best, working on the initial feature space has also its drawbacks. Indeed, we notice that the size of the initial feature space is at least 10 times bigger than with the use of a feature selection strategy.

Consequently, applying any of the classification algorithms on such a big initial feature space will take much more time than working with a reduced feature space.

3.4.1.4 Feature Weighting

When working with the bag of words representation, knowing the frequency of the words in the documents or in the corpus can bring crucial information for classification. Indeed, such frequency information reveals the importance of the concept in each document. Given the frequency information, it is possible to differentiate documents where a concept is central and appears recursively from documents where the same concept appears only once. Many weighting schema exist, they were developed initially for the information retrieval field, but they can also be employed for classification. As it is not the central focus of our work to find the more efficient schema, we only test the most common weighting schema to see if it brings substantial improvement⁴¹

• Binary weighting schema: With this schema the words that appears in a document takes a weight of 1 and the words that doesn’t appears take a weight of 0

• Term frequency: With this schema, the words take as weight the number of occurrence found in the document. A document with the word “X” appearing 5 times is provided with a weight of 5 for the feature “X”.

41 We tested as well TF-IDF schemas but dismissed the results as they were not very conclusive

Weighting schema Training set Test set

Accuracy F-measure Accuracy F-measure

Binary 92.8 91.9 70.3 69.9

Term frequency 93.0 92.2 (+0.3%) 74.5 74.4 (+4.5%)

Table 9 Accuracy and F-Measure obtain by a classification (LibSVM C=3, Gamma= 0.009) on the feature weighted using the given weighting schema

We observe in Table 9 that the best experimental results are obtained using the term frequency-weighting schema. What is more surprising is that the small performance variation induced by a specific weighting schema in the training set is much larger in the test set. Indeed, whereas the variation between the best and the worst accuracy on the training set is only of 1.3%, the difference on the test set is of 5.6%. A reason that can explain this disparity of scale is the absence of some key features in the test set. As the decisions taken on the test set rely on fewer features than in the training set, their weight takes a biggest importance.

At the end of all these tuning experiments, let us summarize what are the best parameters:

- Information gain feature selection - Term frequency weighting

- No balancing of the data required

Dans le document Modular text mining for protein-protein interactions extraction (Page 83-87)