Feature space - Data set and terminology

4 Gene Mention

4.2 Data set and terminology

4.3.3 Feature space

In every classification problem, independently whether the algorithm performing the classification is sequential or not, the choice of an appropriate feature set has a central influence on the performance level. We present here the different feature sets that were generated to offer the best classification efficiency. To generate our features set, we drew our inspiration from the other methods that have been employed to resolve the NER task. We divide these features into three different categories. The first two categories of features are inspired by the rules-based approaches.

The strength of the rules-based approaches relies on the definition of rules that select the relevant words based on the structure of the words and on the general structure of the sentences. We attempt to wrap these characteristics, that are so useful in the rules based approach, using two sets of features. On one hand, to reflect the rules that give information about the inner structure of the words, we define a set of orthographic features. On the other hand, to reflect the rules that give information about the term in its context, we define a syntactical feature set. We try as well to benefit from the strength of the dictionary-based approach by defining a set of lexical features. This lexical feature set transfers the information from specific lexicons to the words contained in the documents and gives information about the belonging of the words to these lexicons.

Feature Name Feature Type Description

1 Number of digit Integer Number of digit in the word 2 All digit Boolean True if all character are digit 3 Digit percentage Double Percentage of digit in the word

4 Special symbol Boolean True if the word contain character different to an alphanumeric symbol

5 Roman number Boolean True if the word is a roman number 6 Number of uppercase Integer Number of uppercase letter

7 First letter uppercase Boolean Detect whether the first letter of the word is an uppercase

8 Uppercase percentage Double Percentage of uppercase letter in the term

9 Length Integer Length of the String

10 Hyphen Boolean True if the word contain an hyphen

Table 28 Orthographic features

4.3.3.1 The orthographic features

We know from the literature (Fukuda, Tamura et al. 1998; Eriksson, Franzén et al. 2002; Tanabe and Wilbur 2002; Seki and Mostafa 2003; Zhou, Zhang et al. 2004; Zhou, Shen et al. 2005) related to the rules-based approach that the gene and protein names can have specific spelling that can be

identified using an appropriate set of rules. Among the particular characteristics that are commonly used in the rules-based approach, the principal one concerns the presence of digits and uppercase letters. Base on this observation, we can define the following feature set:

Figure 18 Proportion of words in the lexicon given the percentage of uppercase letters

We think that such orthographic features can be relevant for the detection of the gene names, given the orthographic specificity of many gene names. Indeed, unlike “common” English words, it is likely that gene names contain one or more uppercase as well as digit characters.

Figure 19 Proportion of words in the lexicon given the percentage of digit

The orthographic specificities of genes related words appear quite clearly while looking at the two histograms (Figure 18 and Figure 19). For instance, the Figure 19 shows that the words composing gene names are much more likely to contain one or more digits. Indeed, whereas only 5% of the

10%

15%

20%

25%

30%

35%

40%

45%

50%

0 1-25 25-50 50-75 75-99 100

Percentage of words

Percentage of uppercase letter in the words

EC LEXICON WSJ

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 1-25 25-50 50-75 75-99 100

Percentage of words

Percentage of digit in the words

EC LEXICON WSJ

words extracted from the Wall Street Journal (WSJ) contain a digit, more than 40% of the words parts of the gene names contained at least one digit.

On the other histogram (Figure 18), where we compare the percentage of uppercase letters in the words of the two data sets, we observe that only ~32% of the words related to gene names are exempt from uppercase letters. Moreover, we notice that 10% of these terms are composed by uppercase letters exclusively. Regarding the words extracted from the WSJ, we observe that it is likely to find words with uppercase letters. Indeed, almost 42% of the words contain an uppercase letter. This can be easily explained by the nature of the documents contained in the WSJ. Indeed, many article parts of this corpus are related to finance and therefore contain names of companies, people, and countries. Even if uppercase letters are not rare in the WSJ corpus, we can still have a high confidence to use uppercase letters as a clue to detect gene mentions. Indeed, as biomedical articles are not likely to contain such proper names, it is probable that the presence of uppercases indicates the presence of a word related to genes.

4.3.3.2 The syntactical features

The syntax is the study of the rules that govern the structure of sentences and that determine their relative grammaticality. One portion of the syntactical information of a sentence is given by the part of speech (POS) (Ruch, Baud et al. 2000). The POS can help identifying gene names. As we know that most of the gene names are nouns, a word identified as a noun has more chances be a gene than a conjunction to. However, we have seen that the gene names are not uniquely composed of single words. In the event of multiword gene names, the words, which are part of the name, are not exclusively nouns but can be adjectives or adverbs as well. Therefore, it is not obvious to make a decision based on POS information without taking the context into account. Moreover, even if we presume that most gene names are possibly nouns, we cannot inverse the assumption and deduce that most nouns are likely to be gene names.

In order to perform our POS tagging, we use the MedPost tagger⁴⁴

The other main syntactical information that can be extracted is the word phrase information. In grammar, a phrase is a group of words that function as a single unit in the syntax of a sentence. It is likely that a gene name forms a single unit in the sentence. Therefore, word phrase information can be very useful to determine the beginning and the end of a gene name.

. MedPost is a part-of-speech tagger that has been specially trained with biomedical text, and therefore, it can deal with the specific terminology employed in this domain.

Six features are dedicated to analyze the POS of the words. In addition to analyzing the POS of the candidate word itself, we are also interested in its context. Looking at the part of speech surrounding the candidate word is important because the conclusion drawn based on the part of speech of a word can be different depending on the part of speech of the previous and/or following words.

Feature Name Feature Type Description

1 Preceding word -2 POS POS Type POS type of the word in position -2

2 Preceding word -1 POS POS Type POS type of the word preceding the current word

3 Word POS POS Type POS type of the word

4 Following word +1 POS POS Type POS type of the word following the current word 5 Following word +2 POS POS Type POS type of the word in position +2

6 Word phrase Type Phrase Type Phrase Type of thee phrase containing the current word

7 Word Phrase size Integer Number of token in the phrase containing the current word

8 Word Phrase all selected Boolean Indicate whether the candidate cover a full phrase or not

9 Word Phrase POS Phrase POS Indicate the word phrase part of speech 10 Inside Bracket Boolean True if the current word is inside a sequence

between bracket

11 Inside Quotation Boolean True if the current word is inside a sequence between quote

12 Inside Parenthesis Boolean True if the current word is inside a sequence between parenthesis

Table 29 Syntactical features

4.3.3.3 Lexical features

Another method that has been widely explored to extract named entities relies on dictionaries. Such method employs a reference lexicon to identify the entities of interest using string matching. In our approach, we also attempt to benefit from the discriminative power of a specialized lexicon by defining a set of lexical features. These features describe the links that exist between the words extracted from the text and a specific vocabulary.

For our task, we use tree particular lexicons: the EC lexicon, the GPSDB lexicon and the Wall Street Journal lexicon. The first lexicon is very specific to this task as it is built by gathering the words composing the gene names selected by expert from the training data set of the task. We named this lexicon the Expert Curators (EC) lexicon. The interest of using such a lexicon relies on a strong hypothesis considering that the gene names in the training set are likely to be the same in the test set.

The second lexicon used to construct our feature set is GPSDB. As GPSDB is a lexicon that aims to describe the gene and gene name as well as their synonym, it is very likely that the words composing GPSDB are relevant indicators for our task. Unfortunately, as seen during the description of the lexicon, only 70% of the target terms of the training set are included in GPSDB.

The idea that leads us to use the third lexicon relies on an opposite principle to the one that motivates the use of the two other lexicons. Indeed, with this lexicon, we attempt to identify the words that do not belong to a gene name, whereas with the other two we try to increase the likelihood of a word to be part of a gene name by linking it to a lexicon related to genomics. In order to discover which words are unrelated to gene names, we use a lexicon composed of all the words

found in a corpus containing documents written in “common” English. Such “common” English words act as negative markers regarding gene names.

Feature Name Feature Type Description

1 Word found alone Boolean Indicate if the word can be found alone or not in the lexicon

2 First word probability Double Indicate the probability of the word of the lexicon to occur at the beginning of a concept

3 Middle word Probability Double Indicate the probability of the word I n the lexicon to occur in the middle of a concept

4 End word Probability Double Indicate the probability of the word I n the lexicon to occur at the end of a concept

Table 30 Lexical features

Before taking a deeper look at these features, we have to be aware about their significance, which can vary depending on the lexicon that is taken in consideration. When we generate a set of features that cover the maximum of possible situations, some of the features do not make sense with every lexicon. More specifically, the features related to the probability at the beginning, middle and end of a concept of the controlled lexicon do not make sense with the “common” English lexicon as it contains only atomic concepts. In the case of a lexicon for which a feature is meaningless, we do not expect the feature to be taken into consideration in the model, as it does not have any discrimination power.

“Lexicon word percentage” indicates the percentage of words shared between the candidate term and the words contained in the controlled lexicon. “Lexicon word found alone” indicates whether, the candidate terms can be found exactly in the reference lexicon. “Lexicon first/middle/end word probability” indicates the probability of a candidate term to be at the beginning/middle/end of a term of the lexicon. This information is useful as some terms can appear in a slightly modified form and sometimes having only the first word of a gene term can be sufficient.

Dans le document Modular text mining for protein-protein interactions extraction (Page 107-111)