• Aucun résultat trouvé

Alternative vocabulary building

5 Gene Normalization

5.2 Task and Data

5.3.2 Alternative vocabulary building

As the normalization is mainly based on term matching, its success is strongly influenced by the ability to manage the problem of writing variations of the gene and protein names. As the researchers are fond of using their own name or synonym when they have to refer to a protein or a gene, a single gene or protein can appear in the literature under a broad number of different spelling. If we cannot deal with these variations, an important part of the normalization becomes obsolete.

One of the possible solutions to manage the terms variations consist in comparing the terms using a fuzzy metric. Instead of looking at the exact similarity, a fuzzy metric gives a level of likeliness between the compared terms. The more two terms are alike the bigger is their similarity score. Such approach allows managing the small spelling variations that are frequent in the biological terminology, but it is useless for retrieving synonyms that differ too much from the initial version of a term. The drawback of relaxing the constraints to select a match is the creation of many false positives. Therefore, we must be very cautious when setting a similarity threshold; it should not be too permissive but still recognize the existing variants.

Another solution consists in using simple rules to generate a large number of variations. This technique has the advantage to increase the number of possible matches but has the drawback of increasing the required number of comparison. In the following subsection, we present the four different techniques that have been employed to extend the lexicon. These four techniques generate four particular types of variants that are quite common in the literature.

5.3.2.1 Non-English variants

Some words do not carry the essence of the protein names but act more as modifiers, giving precision about the proteins. As these words are not always present next to the protein names, they can be considered as optional. For instance, we consider ‘protein’, ‘receptor’, ‘member’, ‘family’, and

‘factor’ as part of this category. The characteristics of these words are strongly similar to the words shared between the WSJ and the reference lexicons. Indeed, we realize that most of them can be found in numerous contexts. This observation motivated us to create a new type of variant built by suppressing from the initial list of protein names the words that appear conjointly in the “common”

English lexicon. Building every combination of protein names, with and without each “common”

English word, is counterproductive as it creates countless combinations. Consequently, we decided to produce only variants where all the “common” English terms are completely removed.

5.3.2.2 Hyphen variants

For some multiword proteins, the use of special characters as separator, such as hyphen, does not follow strict rules. Two words of a protein name found sequentially separated by a hyphen can also appear separated with a space or even concatenated. In order to cover most of the variations that

can be encountered, we build automatically all the possible variants made of words separated by a hyphen using the following rules.

Each time a hyphen is encountered between two words, it is replaced by a space to create a variant.

Whenever a number follows the hyphen, we create an additional variant where we suppress the hyphen. For instance, a protein such as “AMPK-alpha-2 chain” will create the following variants:

- AMPK alpha 2 chain - AMPK-alpha 2 chain - AMPK-alpha2 chain - AMPK alpha 2 chain - AMPK alpha2 chain 5.3.2.3 Splitting variants

Some protein names contain punctuation marks and/or parenthesis. Among the punctuation marks that can be encountered, the colon often separates the principal concept from its particularities.

Similarly, the content of the parenthesis brings frequently precisions to the main term. As this additional information is not always compulsory to identify a term, it can be separate from the key term.

For instance, a term such as “ATP synthase, H+ transporting, mitochondrial F0 complex, subunit c (subunit 9), isoform 1” will create the following variants:

- ATP synthase

- H+ transporting, mitochondrial F0 complex - subunit c

- subunit 9 - isoform 1

5.3.2.4 “Expert” generated variants

Although the three previous variation methods cover many variants missed by the initial vocabulary, there are still some special cases that need to be taken into consideration. By looking at the remaining unreachable normalization, we realize that some differences remain between the terms found in the text and the ones of the reference lexicon. Such missing variant can be created using simple rules. By empirically analyzing the cause of the unreachable normalization, we have induced some particular rule in order to increase the recall. In this way, we use some kind of “expert”

knowledge to construct very specific rules that fit with our data.

The principal rules are the following:

• hXXX  XXXX

• XXX-beta  XXXb

• XXX+  XXX

• XXXXsX  XXXX

Pattern Nb occurrence

hXXX 367

XXX+ 10

XXX-beta 414

XXXXsX 133

Table 45 Frequency of the different pattern in the EntrezGene Lexicon

By looking at the number of terms related to the rules generated, we observe that such rules concern very few terms (Table 45). However, we later realize that such a low number can still influence the outcome of the normalization.

Let us have a look at some figures regarding the variants produced with the different generation techniques

Vocabulary variant Terms Words Common words with init voc

Initial 177,200 129,881 100 %

Hyphen 62,948 24,395 13,348 (54.7%)

Split 18,777 10,713 9,148 (85.4 %)

Expert 18,725 9,642 7,755 (80.4 %)

Non-English 4,1037 20,075 20,075 (100%)

Table 46 Number of generated term and words given the vocabulary variation

Figure 29 Produced words and terms by the different variant generation techniques

The data, gathered in the Table 46, provides some interesting figures about the produced variants.

Let us first give some additional information about the signification of these numbers. The column

“Terms” contains the number of terms produced by the variant generator method. The column

“Words” includes the number of different words that compose these “Terms”. The last column,

“Common words” indicates the amount of produced words variant already present in the initial vocabulary.

0 10000 20000 30000 40000 50000 60000 70000

0 5000 10000 15000 20000 25000 30000

Hyphen Split Expert Without

English term

number of terms

number of words

New Words Similar Words nb Terms

We observe that the produced hyphen variants contain an important number of additional terms.

This is not surprising as if a term contains several hyphens, the number of combinations with or without a hyphen can be large. The other method that produces an important number of new terms is the “Non-English” one. It reveals that many terms in the initial vocabulary contain a word that is not dedicated to the genomic field and can be retrieved as well in the “common” literature. Finally, the “Split” and “Expert” methods produce almost the same amount of terms.

The last statistics provided consist of the percentage of words similar between the original lexicon and the produced terms. This brings to light whether the employed method produces new words, or if it reorganizes the existing terms and/or if it produces variants that already exist in the initial lexicon. We observe that approximately 50% of the words produced using the hyphen variant method did not exist before. For the split variant, a few new words are created because this method is usually used to reorganize the long terms. Concerning the terms produced by the “expert” rules, we also find a small proportion of unknown words. The fact that most of the words produced previously exist in the reference lexicon is an indication about the relevance of the rules that have been employed. Indeed, if a term produced already exists in the initial controlled vocabulary, it means that the rules reflect accurately the behavior of the researcher in creating novel variants.

Finally, the non English variants do not contain any new words. This is expected as the way these variants are produced consists only in removing the words that we consider as ambiguous.