Setting algorithm parameters - Data set and terminology

4 Gene Mention

4.2 Data set and terminology

4.4.2 Setting algorithm parameters

After analyzing the impact of the input data set on the performance, we compare two types of algorithms, a sequential and a non-sequential one, to see which one is the most adapted for the

problem. For both these algorithms, we test a large scope of features and analyze their impact on the classification performance.

However, before testing the different sets of features, we first are interested to tune the algorithms at best, as there are sometimes interesting gains that can be obtained by tuning the parameters, especially for the problem. In the following subsection, we present the parameters that have been tuned for the non-sequential and sequential algorithms.

4.4.2.1.1 Tuning of decision tree parameters

The only parameter that we attempt to tune for our decision tree is the confidence factor. This parameter acts as a threshold during the pruning process. This option determines the confidence value to be used when pruning the tree (removing branches that are deemed to provide little or no gain in statistical accuracy of the model). Since the default of 25% works reasonably well in most cases, it is unlikely that we have to modify it. However, if the actual error rate on real data (or the error rate on cross-validation) is significantly higher than the error rate on the training data, one may want to decrease the confidence factor to cause more drastic pruning, and a more general model for the data. If we want a more specific model based on the training data, we can increase the confidence factor, which will decrease the amount of pruning that occurs. We observe that the confidence factor induces a variation in the recall and precision but keeps a stable F-measure with variation that remain under 0.5%. Therefore, it is not very interesting to care much about the value of the confidence parameter and can keep the default one.

4.4.2.1.2 Tuning CRF parameters

CRF parameters have been divided in two groups: the parameter of the template and the parameter of the algorithm. As the number of possible combinations of all these parameters is quite high we evaluate them separately.

4.4.2.1.3 CRF Templates

CRF models take decisions on instances to be classified based on contextual information. The context is defined using two parameters, the context depth⁴⁵

The other parameter that can be tuned is related to sequence of features. It concerns the situations where some features are not relevant when analyzed independently. Using this parameter, we can define which sequence length to consider (CpLS) and at which maximal depth of the context we should look at (cpLD).

(CD) and the context sequence. The first parameter indicates the depth until which the context must be taken in consideration for each feature. Taking a depth of 1 indicates that we take only the current level of features in consideration and taking a depth of n>1 indicates that we consider as well preceding and/or following features in a limit of +/-n positions.

The way the context is taken into account for the classification is defined in a template file. Thanks to such templates, it is thus possible to inject external knowledge into the model. For our experiment, we decide to generate automatically a limited number of templates that cover the main interesting

45 Equivalent to the order in HMMs.

variations. In order to evaluate the effect on the performance of the three template parameters, we vary them within fix intervals –from 1 to 3- and summarize the results.

Figure 20 Performances variation given different templates parameters

By looking at the Figure 20, we realize that, whereas the parameters have limited impact on the precision, they can have a significant impact on recall. Following the parameters' space exploration, the 1-3-1 template was chosen as it achieves the best F-measure results.

4.4.2.1.4 CRF classification parameters

In addition to the search for the best template, we can play on specific algorithm parameters to generate different output and attempt to improve the performance. Among the possible parameters, we select three of them that can influence the effectiveness of the learner.

• The cost parameter C: this parameter allows for changing the hyper-parameter for the CRFs to trade the balance between overfitting and underfitting.

• The termination criterion: the algorithms perform an estimation of the error rate at every step of the learning process. When the estimated error level goes below a given threshold defined by the parameter, the model is considered as enough trained and the iterative process stops.

• The maximum iteration parameter: this parameter indicates the maximum number of possible iterations before stopping the learning process. It can happen that the minimum error rate is above the fixed threshold and consequently, the algorithm cannot stop. given this termination criterion, and will stop after a defined number of iterations.

45%

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

CRF template parameters (CD, CpLD, CplS)

recall precision FMeasure

Figure 21 F-measure evolution given the variation of Cost, Eta and max iter

The Figure 21 shows that the modifications of the parameters do not induce significant variations on the performance on the training set accuracy and therefore, default parameters are well adapted.

4.4.3 Sequential .vs. Non-sequential model

Now that the two algorithms are tuned at best, we can start the core of our experiment and compare the performance of the two algorithms using specific sets of features. We perform our experiment based on three different types of features: the orthographic, the syntactical and the lexical features.

We compare the difference between the induced performances. The training set is composed of the 15,000 sentences and the test set is composed of 5,000 additional sentences. For the evaluation of the training set, we perform a 10-fold cross-validation. For each experiment, we compare the difference between the training and the test sets as well as the difference between the sequential and non-sequential algorithm.

4.4.3.1 Use of orthographic features

As seen earlier, the orthographic features are the features related to the structure of the words themselves. These features rely on the specific spelling that is inherent to the protein/gene names. In particular, the features are interested in the presence of uppercase letter, numbers and special characters.

78.4%

78.6%

78.8%

79.0%

79.2%

79.4%

79.6%

1 5.5 10 1 5.5 10 1 5.5 10 1 5.5 10 1 5.5 10 1 5.5 10 1 5.5 10 1 5.5 10 1 5.5 10

1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3

F-Measure

cost, Eta and maxIter parameter value

Precision Recall F-Measure

Decision Tree Training 73.7 54.2 62.5

Test 73.0 54.5 62.4

CRF Training 86.4 71.4 78.2

Test 86.5 74.7 80.2

Table 38 Classifications results using orthographic features

Orthographic features seem to produce results that favor precision over recall (Table 38). As many gene names have a particular morphology including uppercase letters and sometimes figures, it seems logical that such features are efficient to detect the terms of interest with precision. However, despite many gene names have a singular spelling and morphology, there still is a significant part of the terms that do not possess any special orthographical clues and cannot be retrieved using such features. Regarding the algorithm, we clearly see that the sequential one outperforms the non-sequential.

4.4.3.2 Syntactical features

The syntactical features are mainly responsible for providing the part of speech (POS) information and word phrase information. The interest of representing the instances using such kind of features relies on the increase of generalization power. A model, which is built using the words as features, is of high dimension and possesses weak generalization ability. On the contrary, when we represent the document with POS information, we can produce a model of higher level of abstraction relying on fewer features. With such model, it is more likely to find similarities between the training and test sets and to take decisions based on examples, which have already been seen.

Precision Recall F-Measure

Decision Tree Training 64.9 62.7 63.8

Test 64.8 60.6 62.6

CRF Training 83.8 68.1 75.1

Test 85.7 70.4 77.3

Table 39 Performance measurement with the use of syntactical features

In comparison to the orthographic feature, we realize that the model built by a decision tree offers a lower precision and a better recall (Table 39). It seems to indicate that words related to genes share a common syntax but that many irrelevant words share this syntax as well. Regarding the CRF algorithm, we can make the same observation as with the orthographic features because the results are similar.

4.4.3.3 Use of Lexical features

We defined earlier three different lexicons in order to generate the lexical features: GPSDB, the Gene Mention lexicon and the WSJ. Each lexicon has its own properties. The Gene Mention lexicon is very specific for the task but, given its reduced size, is unlikely to possess a strong generalization ability.

The GPSDB lexicon possesses a much better coverage as its purpose is to cover the maximum number of protein/gene names, synonyms and other variants, but it may not contain the specific variation of the corpus. Finally, the WSJ lexicon works on an opposite principle from the other lexicons. Indeed, instead of attempting to identify the words that could be part of protein/gene names, it aims to dismiss the words that are likely to be part of the usual English.

Lexicon Precision Recall F-Measure

Decision Tree

All 81.1 63.6 71.3

Gene Mention 80.5 62 70

GPSDB 68.5 67.7 68.1

Wall Street J 59.6 72.9 65.6

CRF

All 89.1 58.1 70.3

Gene Mention 88.8 56.2 68.8

GPSDB 86.5 67.1 75.6

Wall Street J 86.0 67.1 75.4

Table 40 Classification performance on the test set using different controlled vocabulary

By working with the Gene mention lexicon, we obtain the results with the highest precision and the lowest recall (Table 40). Such phenomenon can be explained by the nature of this lexicon. Because it is very specific, it does not contain as much noise as the other lexicons. Unfortunately, whereas this specificity is a strength regarding the precision, it is a weakness regarding the recall. Given its narrow coverage, it is likely that its good performance with the test set will be lower with training set. We can say, in other words, that the features contained in this lexicon have a different distribution in the test set and prevent us from reaching an optimal recall. It is interesting to see that with CRF, the GPSDB and the WSJ lexicons, we reach similar results. This is not surprising as both lexicons tend to identify the same terms as relevant in the end, but in an opposite manner. This similarity of performances between the GPSDB and WSJ lexicons is not observed with the decision tree. With this algorithm, we obtain results with lower precision using the WSJ lexicon than using the GPSDB one.

This can be explained by the fact that, whereas GPSDB attaches the information to the words that we are interest in, the WSJ attaches the information to the words that should not be selected.

Consequently, it is simpler to select with confidence the positive candidates (leading to a better precision) with the GPSDB lexicon, while it is easier to avoid dismissing negative candidates (leading to a better recall) with the WSJ lexicon.

Precision Recall F-Measure

Decision Tree Training 85.3 83.4 84.3

Test 81.1 63.6 71.3

CRF Training 91.2 83.0 86.9

Test 89.1 58.1 70.3

Table 41 classification performance obtained by mixing the features of all corpuses

In the case of the decision tree, the use of corpus features generates results with a balanced level of recall and precision when the cross-validation is performed in the training set (Table 41). It demonstrates that the words of a controlled vocabulary can be valuable indicators to detect gene names in the text when the training and test documents are homogeneous. Unfortunately, these encouraging results obtained with the training set are not reproduced when the model is applied on the test set. This illustrates that such features are strongly dependent on the nature of the dataset. If the nature of the test sets is significantly different from the training, the performances fall. Regarding the decision tree results, we can say that the corpus features cannot be generalized to every data set. With the conditional random field, we also reach a significantly better performance on the training set than on the test one.

4.4.3.4 Summary of results

In this subsection, we present a table that summarizes the results obtained using the different parameters. We clearly see that the use of the sequential model strongly outperforms the use of the non-sequential model.

Method Feature Precision Recall F-Measure

Decision Tree

All 80.5 75.5 77.9

Syntactical 64.8 (-15.7%) 60.6 (-14.9%) 62.6 (-15.3%) Orthographic 73.0 (-7.5%) 54.5 (-21.0%) 62.4 (-15.5%) Lexical 81.1 (+0.6%) 63.6 (-11.9%) 71.3 (-6.6%)

CRF

All 87.5 72.8 79.5

Syntactical 85.7 (-1.8 %) 70.4 (-2.4 %) 77.3 (-2.2 %) Orthographic 86.5 (-1.0 %) 74.7 (+1.9 %) 80.2 (+0.7 %) Lexical 89.1 (+1.2 %) 58.1 (-14.4 %) 70.3 (-7.0 %)

Table 42 Summary of the performance obtained on the test set given the different approaches

Overall, we remark that, regarding the precision, the CRF algorithm outperforms the decision tree (Figure 22). It confirms the idea that the context brings important information to be able to determine the relevance of all the words that are ambiguous. Indeed, we saw in the previous experiment that many words are common both in the terminology related to gene names and the

“common” English. Such words produced many false candidates (reducing the precision) when not selected carefully.

Figure 22 Comparison of the different approaches

Regarding the different sets of features employed, we see distinctively that the CRF algorithm makes the best use of the syntactical and the orthographic features. For the lexical features, the combination of all the features is clearly the most efficient. It is certainly, because the decision tree algorithm can make better decisions when many features are provided, as it has more possibilities to choose the most discriminative ones.

50%

55%

60%

65%

70%

75%

80%

60% 70% 80% 90% 100%

Recall

Precision

Decision Tree All Decision Tree Syntactical Decision Tree Orthographical Decision Tree Lexical

CRF All CRF Syntactical CRF Orthographical CRF Lexical

Dans le document Modular text mining for protein-protein interactions extraction (Page 115-122)