Influence of the quality of the initial documents set

4 Gene Mention

4.2 Data set and terminology

4.4.1 Influence of the quality of the initial documents set

Looking at the sentences contained in the training data, we notice that many of them do not contain any gene name. Indeed, among the 15,000 sentences of the training set, only 7,676 sentences contain at least a gene name. It means that almost 50% of the sentences can be considered as noise.

Based on this observation, we make the hypothesis that it is possible to improve the precision significantly by removing all the “empty” sentences. Indeed, as the “empty” sentences can only generate false positives, removing them from the initial sentence set should have a positive effect on the precision.

In order to study the influence of the quality of the initial sentence set on the performance, we generate three different sentence sets as input for the NER process. The first sentence set is composed of all the sentences provided for the task. The second sentence set is cleaned using a binary classifier that aims to only select the sentences containing at least one gene name. Finally, the last set is exempt from all “empty” sentences as it is built based on the expert curation on the training data.

4.4.1.1 Selecting sentences set using classification

In order to remove the “empty” sentences, we can use a machine-learning algorithm to classify the sentences of the initial set as appropriate or not to contain gene name. As in the previous chapter, we have developed a model that classifies the documents as likely or unlikely to contain gene interaction, we can use the same strategy to build a specific model that makes the difference between the relevant and irrelevant sentences.

4.4.1.1.1 Developing task specific model

We build the new model based on 15,000 sentences provided in the training set. In this data set, the two classes are almost equally distributed 7,676 (51.2%) positive instances and 7,324 (48.8%)

negative instances. As the distribution on the test set is similar to the training set, we do not need any balancing of the data. To build our model, we employ the same parameters as those employed in the previous chapter:

• single words as feature

• information gain feature selection

• binary term weighting schema

• LibSVM classifier (default parameter) 4.4.1.1.1.1 Cross-validation on the training set

TP Rate FP rate Precision Recall F-Measure

True 51.3% 5.6% 90.6% 51.3% 65.5%

False 94.4% 48.7% 64.9% 94.4% 76.9%

Macro-average 69.5% 30.2% 76.4% 69.5% 71.1%

Table 31 Classification of the sentences of the test set using cross validation

As the training instances are pretty well balanced, we expect to have well balanced results (Table 31).

Surprisingly, the results show that the “false” class is privileged. Indeed, whereas the recall obtained on the false instances is close to 100% (94.9%), the recall on the positive instance is less than 50%

(44.4%). It means that many sentences containing gene names have been classified as “empty”

sentences. We are not very satisfied of such behavior. As every false negative reduces the final recall by taking of definitively a chance of finding gene names, it is incompatible with our goal to keep the recall on the true class as high as possible. Consequently, in order to balance these results more efficiently, we use a cost-sensitive classifier that overweight the cost of producing false negative.

Cost TP Rate FP rate Precision Recall F-Measure Overall Accuracy

1 51.3 5.6 90.6 51.3 65.5 72.3

1.2 87.2 12.8 86.0 74.4 79.9 80.8 (+8.5)

1.3 91.8 39.9 70.7 91.8 79.9 76.3 (+4.0)

1.4 98.5 89.6 53.6 98.5 69.4 55.5 (-16.8)

Table 32 Performance on the true instances given different misclassification-cost using cross validation

The Table 32 shows that it is possible to improve the recall on the true instances by increasing the misclassification cost of false negative.

4.4.1.1.1.2 Classifying test set instances

We saw that we have to be cautious about drawing general conclusions based on the results obtained on cross-validation on the training set. In the event of the nature of the training set is different from the test set, the results obtained with cross-validation can be much different from those obtained by applying the generated model on the test set.

TP Rate FP rate Precision Recall F-Measure

True 52.4 6.4 90.3 52.4 66.3

False 93.6 47.6 63.4 93.6 75.6

Macro-average 72.5 26.5 77.2 72.5 70.8

Table 33 Classification of the sentences of the test set using the model build on the training set

In this subsection, we check if the conclusions based on the training set are still valid with the test set. The first look on the results is reassuring as it shows almost similar results as those obtained on

the training set (Table 33). This similarity between the performance of the training and the test sets is confirmed when we look at the results given different misclassification cost.

Cost TP Rate FP rate Precision Recall F-Measure Overall Accuracy

1 52.4 6.4 90.3 52.4 66.3 71.7

1.2 71.6 10.6 87.5 71.6 78.8 80.3 (+ 8.6%)

1.3 84.7 21.7 81.6 84.7 83.1 81.7 (+10.0%)

1.4 95.5 76.6 58.5 95.5 72.6 61.7 (-10.0%)

Table 34 Performance on the true instances given different misclassification-cost using on the test set

We observe on the Table 34 that the global evolution of the performance, given the misclassification cost, follows the same tendency as the one observe in the training set. The more it is costly to classify the positive instance as negative (false negative) the more the algorithm privileges the true class.

Consequently, the more important is the misclassification cost, the higher is the recall on the true class. It has unfortunately the side effect of reducing the precision, as more negative will be classified as positive (false positive).

4.4.1.2 Influence of the input data set on the performance

Now that we have produced different input sets of sentences, we want to study the influence of their quality on the overall performance. For this purpose, we provide these three singular sets of sentences as input of the NER process and observe the influence on the results. The first set of sentences is the one initially provided for the task: it is composed of 15,000 sentences of which 7,324 are “empty” sentences. The second set of sentences is the one generated using the binary classification performed in the previous subsection. It contains 8,291 sentences of which 7,271 contain at least one gene. The last set of sentences is composed exclusively of the 7,676 sentences containing at least one gene (Table 35).

Sentence set Nb sentence Relevant Irrelevant recall precision

Initial 15000 51.2 48.8 100 51.2

Positive classified 9967 70.7 29.3 91.8 87.7

Perfect 7676 100 0 100 100

Table 35 Composition of the different sentence sets

In order to evaluate the influence of these different sets of sentences as input, we do not employ a complex model implying classification. Instead, we prefer to perform simple experiments that allow focusing on the influence of the input data set without being interfered by other additional parameters. The first set of experiments is focused on the recall, it attempts to select the larger number of candidates by considering every “uncommon” English term as a candidate. The second experiment is more focused on the precision as the candidates are selected based on their belonging to the GPSDB terminology.

4.4.1.2.1 Recall focused evaluation

In this first set of experiments, we evaluate the influence of the initial set of sentences on the recognition process in case we are focused on the recall. This experiment helps us find out if the positive sentences missed during the binary classification (false positive) have an influence on the overall recall.

In order to extract the candidate genes, we select all the words of the sentences that do not occur in the “common” English lexicon (see section 3 for more details). This method allows us to reach an almost perfect recall as many words are selected as candidates. Unfortunately, because many words that do not belong to “common” English lexicon are not compulsorily related to a gene name, lots of false positives are also produced.

document set Nb words Recall Precision F-Measure

Initial 92528 95.9 17.2 29.2

Positive classified 70941 95.9 20.6 (+3.4%) 33.9 (+4.7%)

Perfect 67882 95.9 23.5 (+6.3%) 37.8 (+8.6%)

Table 36 Performance measure obtain by extracting the non "common" English words

We observe on the Table 36 that the positive sentences missed during the initial classification process have no effect on the recall of mention extraction. This constant recall seems to indicate that the sentences classified as false negatives during the classification step contain very ambiguous gene names that cannot be recognized.

We see that there is a significant increase of precision when the number of irrelevant sentences in the input set is reduced. However, we also realize that the improvement in precision is not proportional to the reduction of the unsuitable documents. As the irrelevant sentences are not likely to contain many candidates, the suppression of 50% of the irrelevant documents only increases the precision of 6.3%.

4.4.1.2.2 Precision focused evaluation

In opposition to the previous experiments where the aim was to keep a recall as high as possible, here we are more interested in maximizing the precision. For this purpose, we identify the terms of interest by matching them against a controlled lexicon. The lexicon use as reference in this experiment is GPSDB. By matching the words of the documents with those of the GPSDB lexicon, we expect to extract high quality candidates but also to miss a significant number of terms as no controlled lexicon can cover all the term variations that appear in the text. Indeed, as soon as new terms appear or when authors employ an unusual synonym (standard practice) these unexpected terms are not being recognized since they do not occur in the controlled vocabulary.

document set Nb terms Recall Precision F-Measure

Initial 27099 53.9 34.4 42.0

Positive classified 20332 53.9 42.7 (+8.3%) 47.7 (+5.7 %)

Perfect 20043 53.9 46.5 (+12.1%) 49.9 (+7.9 %)

Table 37 Performance measures obtain by extracting only the "uncommon" English terms

In addition, we see that the false negatives present in the initial set of sentences have no negative effect on the overall recall (Table 37). We see as well that reducing the number of “empty” sentences in the input data set has a beneficial effect on the overall performance by reducing the number of false positives.

Dans le document Modular text mining for protein-protein interactions extraction (Page 112-115)