• Aucun résultat trouvé

Comparison with BioCreAtIvE II results

4 Gene Mention

4.5 Comparison with BioCreAtIvE II results

As we explained at the beginning of this chapter, the data supporting our experiments is the same that was used during the BioCreAtIvE II contest. Therefore, we are able to compare our best results with those obtained during the contest.

The comparison of our results with those of the other groups helps us to highlight the weaknesses and strengths of our approach. Whereas we reach a similar precision level to the best ones obtained by other groups, we are a bit under the highest recall level (Figure 23). This substandard level of recall can be explained by the evaluation approach. Indeed, in the BioCreAtIvE II contest, the evaluation procedure was less strict that the one we performed, due to the inclusion of alternative version of gene names. As the boundaries of the genes are difficult to be defined, the organizer also provided a significant number of alternative versions for the existing genes. The inclusion of these variants in the evaluation process obviously increases the chance of finding the correct gene mentions and as a result, increases the recall level. Contrary to the contest, in our evaluation framework, we only employed a unique form of the gene. Therefore, we can reasonably think that some gene mentions that we consider as false positive could have been classified as true by including the substitution forms. The importance of these alternative forms can be clearly observed in the results provided for the contest. A mean of 22% (8% Standard deviation) of the mentions have been matched against different forms of the provided reference gene names. Consequently, we can assume with confidence that our recall level would have increased to a similar level as the one of the other approaches, if the variants were taken in consideration.

Figure 23 Recall and precision obtained for the runs of the different groups

We observe in the Figure 24 that, given the F-measure, we ranked in the middle in comparison with the different groups. As we already discussed, an important reason for our lack of F-measure is due to the inferior level of recall. Consequently, we can say that regarding the way we perform the evaluation, our method is likely to be competitive with the best systems that were proposed during the contest.

40%

50%

60%

70%

80%

90%

100%

45% 55% 65% 75% 85% 95% 105%

Recall

Precision

Best Other s

Figure 24 F-measure obtained by the different groups during BioCreAtIvE II

4.6 Discussion

In the following section, we will discuss the interesting facts that we infer from our experiments.

4.6.1 The influence of selecting a relevant set of documents

The experiment performed to evaluate the influence of the quality of the input data set on the NER performance clearly shows that filtering the input documents set before performing the NER process is an efficient way to improve the precision. Indeed, by suppressing the documents that are exempt from gene names, we reduce as well the risk of selecting inappropriate candidates contained in these sentences. However, we have to be aware that filtering the input documents set removes irrelevant documents but can also dismiss some pertinent documents from the set. Therefore, if we aim to optimize the recall, it can be more relevant to keep the initial data set.

4.6.2 Overall comparison of the algorithms

The performances obtained with conditional random field (CRF) are generally better than those obtained using decision tree. This is not a surprising result as the sequential method can make use of more information than the non-sequential one. However, we notice two things. First, we observe that the decision tree algorithm seems to benefits more from the combination of the features than the CRF algorithm. Consequently, we can make the hypothesis that if we are able to provide additional relevant features to the decision tree algorithm it may be possible to increase the performance. The second observation is in direct consequence of the first. When the features are combined, the decision tree algorithm can reach a competitive recall. Therefore, again, depending of the purpose, using such a sequential algorithm can really make sense.

40%

45%

50%

55%

60%

65%

70%

75%

80%

85%

90%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

F-measure

Group

4.6.3 Different feature have different advantages

Depending on the selected algorithm, the features can be more or less suitable. For the CRF algorithm, the use of structural information such as the morphology and the syntax generates higher recall results than with the use of lexicons. The reason for such a phenomenon is that, whereas the usage of lexicons has obvious limitations on the recall regarding their coverage (even GPSDB does not cover more than 71% of the terms of the training set), structural information can cover a much larger number of cases and consequently, increase the recall. On another hand, for the decision tree algorithm, the lexical features provide the results with the best precision. It seems to indicate that deciding which are the relevant candidates based on a list of words leads to a lower number of mistakes in comparison to the use of more fuzzy rules based on the structure. Indeed, once the ambiguous words are dismissed, the algorithm can decide with very high confidence that a word is relevant if it occurs in a controlled lexicon.

4.7 Conclusion

In this chapter, we were interested in solving an important task for most of the information extraction processes: the recognition of named entities. In molecular biology, this is a challenging task considering the absence of regular rules defining the morphology of gene names. We performed a large set of experiments in order to reach a double objective. First, with the purpose of answering our general hypothesis assuming that there are strong links between the subtasks of a complex process, we studied the influence of the quality of the input documents set on the recognition accuracy. Then, as we assumed that a non-sequential algorithm is likely to reveal different properties than a sequential algorithm regarding effectiveness metrics, we performed a series of experiments to compare the performance obtained using a conditional random field algorithm and a decision tree in addition to specific sets of features.

We observed during the analysis of the training set that half of the input sentences were exempt from any gene names. As such “empty” sentences could only reduce the overall accuracy of the extraction, it seemed logical to attempt to sort the initial sentences set to remove them. Our experiments evaluate the variations of performance induced by three data sets containing a different amount of noise. The first data set contained as many relevant sentences as irrelevant ones, the second data set was restricted to the sentences of the initial data set classified as pertinent, and finally the last data set was optimal as it contained only relevant sentences. In order to test the influence of these three data sets, we performed two simple experiments. The first one was focused on the recall and consisted in considering positive every word that does not occur in a “common”

English lexicon. The other was more focused on precision and selected as positive every term that could be found in GPSDB. Both experiments showed that performing an initial sentence selection before performing the NER process induces an increase of 5% in F-measure, this improvement was subsequent to a gain of precision in dismissing irrelevant sentences.

The second part of the experiment was dedicated to the exploration of the methods and features that can be employed to perform the NER task. More specifically, we tested three different sets of features. The orthographic features, that concern the morphology of the words that we classify, the syntactical features that give information about the role of the words in the sentences, and finally the lexical features that transfer the information gathered from a specific lexicon to the words. All results showed that the sequential algorithm tends to be more efficient to balance the recall and the

precision and globally outperforms the non-sequential algorithm. This domination was obvious with the orthographic and syntactical features, as the CRF outperforms the decision tree by approximately 15% in F-measure. However, with the lexical features and with the combination of features, the difference was not significant (from -1.6 to 1.6% of F-measure increase). If we look at the features independently, the best recall (74.7%) was achieved with the Markovian approach using the orthographic features. This seems to confirm that the particular spelling and morphology of gene names are important characteristics to recognize them. Regarding the precision, the best results were obtained using the controlled vocabularies (88.7%). The presence of a word into a specific vocabulary seems to be a very high-quality predictor of the belonging of this word to a gene name.

Unfortunately, the good precision obtained with lexical features is not reflected on the recall (57.4%).

It suggests once again the inherent coverage limitation intrinsic to most of the lexicons in the biomedical field.

Finding a mention of gene in the text is of crucial importance. However, given the strong ambiguity relative to biomedical terminology, a string found in a document is not sufficient to know clearly about which gene the author refers to. In order to identify unambiguously the gene in the text, we need to link the discovered mentions to a database to obtain a unique identifier. This task is known as normalization and is tackled in the next chapter.