Extracting candidate terms for comparison

5 Gene Normalization

5.2 Task and Data

5.3.1 Extracting candidate terms for comparison

Before normalizing the entities mentioned in the documents, the first question that has to answer is

“What are the entities that we want to normalize?” We start with a collection of documents and we have to select from these documents a set of candidate terms that will then be compared with the terms of the controlled lexicon. The quality of the initial set of candidates to normalize has a strong influence on the outcome of the normalization. If we provide a broad set of candidates in order to avoid limitations on the recall, it is probable that the mapping process produces a large quantity of false positives and consequently, generates results of low precision. On the other hand, by providing a too restrictive set of candidates to favor the precision, it is likely that many relevant candidates are missed and as a result, the recall cannot reach its maximal value.

In the following subsection, we present four different ways of generating the candidates. These generated candidates are then used in the experimental section to explore the influence of the candidates’ sets quality on the overall performance.

5.3.1.1 Generating the candidates

Before speaking about the different methods employed to extract the candidates from the documents, it is worth having a look at the properties of the terms of the controlled lexicon. We need specially to be aware that the protein names are not only single words but also words sequences. Indeed, as seen in Figure 26, whereas 65% percent of the protein names contained in EntrezGene are single words, there are still the remaining 45% percent that are composed of more than one term (Figure 26).

Figure 26 Number of word per term contained in EntrezGene

As all the techniques we employ for the generation of candidates extract only words, we have to find a strategy to combine these words to compose terms that can be matched against those of the controlled vocabulary. In order to combine the words selected as relevant candidates, we draw inspiration from a technique known as sliding windows (Krallinger, Padron et al. 2005) and generate all the possible combinations of words found in sequence. For instance, if three words A B C are found in sequence, we generate six candidates [A, B, C, AB, BC, and ABC]. This procedure has the advantage of being simple but has the drawback of generating many overlapping candidates. These overlapping candidates have a negative effect on the recall. Indeed, only one of them is correctly mapped during the normalization and all the others reduce the precision.

This side effect is problematic when many words are found in sequence as many false positives are produced.

6% 65%

14%

1 2 3 4

Figure 27 Number of words combinations produced given the number of selected words found in sequence

For example, we observe in the Figure 27 that, for 10 words found in sequence, 55 candidates are produced. If only one is relevant, it leads to a recall of 100% and to a catastrophic precision of 1.8%

(1/55).

5.3.1.2 No selection for candidates extraction

If we do not constrain the selection of the candidates in the input set of documents to be normalized, every word sequence extracted from the text can be considered as a candidate. This simple assumption implies to search the best mapping between all the word sequences from the text and all the possible names contained in the reference vocabulary. This approach seems straightforward and attractive. However, given its complexity, it can hardly be applied on a real problem. Indeed, comparing each term of the source documents directly to all protein names contained in the reference vocabulary is extremely resource consuming. This complexity is due to the dimensionality of the two sets that have to be compared. If we consider every word of the document as a possible candidate and we generate candidate terms based on a sequence of this word, we generate

𝑃𝑃 = � � � 𝑖𝑖

|𝑠𝑠|

𝑖𝑖=1

|𝑑𝑑|

𝑠𝑠=1 𝐴𝐴 𝑑𝑑=1

(4)

For a sentence s of length |s| we produce ∑^|𝑠𝑠|_𝑖𝑖=1𝑖𝑖 candidates; we have then to sum all the candidates of each sentence of each document of the corpus of D documents.

If we compare all these P candidates extracted from the text with the T target terms in the controlled lexicon we perform 𝑃𝑃 ∗ 𝑇𝑇 comparisons. This large number of comparisons is not problematic in case the terms are compared using perfect match as the comparisons can be done very efficiently.

However, when using a more costly metric, able to manage term variations, the comparisons become a much more expensive process. For this reason, we do not attempt to do all the comparisons and we just mention this strategy as a reference to see the advantages brought by the other techniques.

0 10 20 30 40 50 60

1 2 3 4 5 6 7 8 9 10

number of candidates produced

Number of word found in sequence

5.3.1.3 “Uncommon” English words candidate set

Selecting every word as a candidate has the advantage to avoid any limitation on the recall.

Unfortunately taking into consideration a large number of candidates brings an extreme overhead to the comparison process and produces many false positives. A good compromise to avoid limiting the recall and to reduce the complexity of the task consists in selecting for comparison only the words that do not appear in a “common” English terminology. Indeed, we have seen previously that the terms related to the proteins and genes names are part of a quite specific lexicon, which is not shared with “common” English literature. Consequently, we can limit the number of candidates extracted from the text by dismissing “common” words. What we call the “common” English terminology is a terminology built on the words found in the documents contained in three years of the Wall Street Journal.

5.3.1.3.1 The importance of “shared“ words

When we take a closer look at the “common” English lexicon, we realize that a small part of the words contained in this lexicon is also employed to describe protein names. We have to be careful when managing this subgroup of words as they can have a significant influence on the performance.

There is no simple way to decide whether they should be considered as relevant or not. If these words are considered as part of “common” English lexicon, the proteins containing these words are missed and it induces a loss in recall. On the other hand, if we consider these words as part of the biomedical lexicon, we can retrieve the proteins containing such words correctly; however, we create a lot of false negative. A middle way consists of considering some of the shared words as part the English lexicon and the others as part of protein names. In order to decide in which lexicon belongs each word, we base our decision on a frequency measurement. We define a relative frequency measure that simply put in relation the frequency of the words in the EntrezGene lexicon and in the Wall Street Journal (WSJ) lexicon.

𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑡𝑡𝑖𝑖𝑅𝑅𝑅𝑅 𝑓𝑓𝑓𝑓𝑅𝑅𝑓𝑓𝑓𝑓𝑅𝑅𝑛𝑛𝑐𝑐𝑓𝑓 = 𝐸𝐸𝑛𝑛𝑡𝑡𝑓𝑓𝑅𝑅𝐸𝐸𝐸𝐸𝑅𝑅𝑛𝑛𝑅𝑅 𝐹𝐹𝑓𝑓𝑅𝑅𝑓𝑓𝑓𝑓𝑅𝑅𝑛𝑛𝑐𝑐𝑓𝑓

𝑊𝑊𝑊𝑊𝑊𝑊 𝐹𝐹𝑓𝑓𝑅𝑅𝑓𝑓𝑓𝑓𝑅𝑅𝑛𝑛𝑐𝑐𝑓𝑓 (5)

All the words that have a high relative frequency (frequency in EntrezGene >> frequency in WSJ) are likely to be part of a protein name. Similarly, the words with low relative frequency (frequency in EntrezGene << frequency in WSJ) can be considered as common words. For all the words with a relative frequency close to 100, the decision is trickier. For this specific case, we base our decision on the absolute frequency of the term in the two lexicons. As we expect that low frequency words in WSJ are more artifacts than real “common” words, we consider them as part of protein names.

Let us have a look at some interesting figures: there are 5,759 words shared between the “common”

English lexicon and the EntrezGene Lexicon. Among these words, 74% are more frequent in the WSJ corpus than in the EntrezGene lexicon. In 15% of the cases, the frequency is the same in both corpus and, in 11% of the cases, the words are more frequently used in the EntrezGene lexicon than in the WSJ corpus.

Figure 28 relative frequencies of the shared terms between the EntrezGene and the Wall Street Journal corpuses

5.3.1.4 NER produced candidates

As the named entity recognition takes place after normalization, it is natural to use the recognized entities as candidates to be mapped to the controlled vocabulary. In order to start with the maximum chance to reach an optimal accuracy, it is appropriate to employ the method returning the most candidates. Indeed, all the terms excluded at the earlier stage are lost and as a result, the recall obtained at the previous stage act as a boundary for the recall obtains at the normalization step.

Therefore, we employ the model based on the morphological features.

Feature Precision Recall F-Measure

Morphological 86.5 (-1.0 %) 74.7 (+1.9 %) 80.2 (+0.7 %)

Table 43 Performance obtained in protein/gene recognition using CRF with morphological features

The Table 43 reminds us of the performances obtained in the previous task. As more than 25% of the terms are missed during the selection of candidates, it limits the maximum recall of the normalization. Concerning the precision, few false positives are created. We have to be careful of drawing conclusions from these figures. Indeed, these results were obtained during the previous task and we do not know exactly how the strategy can manage the difference of the current data set.

5.3.1.5 Perfect candidate set

Strategy Number of produced candidates

All words 55,466 (100 %)

“uncommon” English words 24,524 (44.2 %) Named entity mention 6,929 (12.5 %)

Perfect Candidate 2,281 (4.1 %)

Table 44 Number of candidate word given the different candidate generation approaches

Table 44 presents a summary of the number of words selected for further comparison with the terms of the controlled lexicon. We realize that the selection of only “uncommon” English words reduces more than twice the number of candidates. Moreover, it does not reduce the possible recall as

0.0001 0.001 0.01 0.1 1 10 100 1000 10000 100000

Relative frequency

almost no potential relevant words are dismissed. With a more aggressive method, such as a named entities mention, the reduction of candidates is much more important. Less of 15% of the initial words are retained. This suggests that using such approach allows for increasing the precision, since there are fewer false positive candidates. However, as opposed to the “uncommon” English approach, the conservation of an optimal recall is not guaranteed. Finally, the low number of words (4%) conserved in the perfect candidate set makes us aware that a very limited number of words are sufficient to perform the normalization process.

Dans le document Modular text mining for protein-protein interactions extraction (Page 130-135)