The corpus - Data set and terminology

4 Gene Mention

4.2 Data set and terminology

4.2.2 The corpus

As we design the gene mentions identification as a classification problem, it is important to link every word to the maximum of information. Based on the information associated to the words, the classifier can construct an efficient model capable of differentiating common words from those, which are part of gene names.

An approach to associate additional information to the words of the documents consists of linking them to the terms of a controlled lexicon. Establishing links allows us to project the properties of the controlled lexical entries and increase the semantic content of the documents. For instance, linking the words of a document to the terms contained in a lexicon of biological entities identifies the words as potential biological entities. As a result, it is possible to take decisions that are more accurate. All the features, which are designed to describe the relation between the words and a specific lexicon, are grouped under the label of lexical features. The resources employed to build these features are well-known existing biomedical lexicons, but also task-specific ones. We present in the following subsections both generated and available terminologies that we have been using in our experiments.

4.2.2.1 Expert Curators Lexicon

What we call Expert Curators (EC) lexicon is a vocabulary regrouping all the gene names extracted by the experts from the documents of the training set during the curation process. Let us remind you that we have 15,000 sentences in the training set and that the curators retrieved from the collection

10%

20%

30%

40%

50%

60%

12 11 10 9

8 7 6 5 4 3 2 1

proportion of protein name

Nb Word in protein name

18,256 gene names composed of 11,627 different words. Using the EC lexicon to produce lexical features is an attempt to check whether using a very specific lexicon, strongly related to gene names, can improve the performance.

Figure 15 Common and exclusive words in the training and test set of the Gene Mention task

We can see in Figure 15 that only 50% of the words composing the gene names of the test set of the Gene mention task can also be found in the training set. It means that if we attempt to decide which words of the test set are part of gene names, based on their belonging to this lexicon, we are able to retrieve only 50% of the mentions. Such recall is definitively below the one we are looking for and should convince us not to use a restricted vocabulary to construct the features space. However, even if it seems inappropriate to use this lexicon as unique source of features, it can be an economical way to increase the recall. Indeed, used in combination with other features, it could become interesting.

4.2.2.2 Expert Curators Lexicon .vs. GPSDB

By looking at the property of the EC lexicon, we realize that it possesses a highly specialized and limited coverage content that is not likely to offer satisfactory generalization power. A way to solve this issue is to take a lexicon that is still strongly related to gene and protein terminology but that possess a broader coverage. The terminology selected to fulfill our requirement is the Gene and Protein Synonym DataBase (GPSDB) lexicon. This lexicon contains a lot of synonyms and term variants and is updated regularly. More precisely, this resource contains 532,970 different synonyms describing 319,386 gene and protein entities. The total number of species/subspecies taken into account exceeds 7,000.

19% 62%

19%

Words only in EC train GM train common

Common words in train and test

Figure 16 Number of terms per unique ID in GPSDB

It is interesting to contrast the coverage of both EC and GPSDB resources. We observe that among the 9,574 words existing in the EC lexicon (gene names of the training set), 6,789 (70.9%) words are also contained in the GPSDB lexicon. On the side of the test set this proportion is slightly better as 78% of the words composing the gene names of the test set can be retrieved in the GPSDB lexicon.

These two percentages are interesting as they reveal that a lexicon aiming to cover most of the existing gene, protein names, as well as their synonyms and variants, still misses 30% of the names that are employed in the literature. Even a terminology such as GPSDB, which is updated frequently and that aims for a large coverage, fails to cover all the potential gene name variations. It underlines that the terminology employed by scientists evolves too quickly to be recorded in any lexicon.

4.2.2.3 Wall Street Journal terminology

We make a strong statement about the nature of the terminology related to gene and protein names. We assume that the words, part of this terminology, are specific to the biomedical domain and are unlikely to be employed in general English. Based on this hypothesis, we pretend that an English lexicon can be used to identify the words related to gene and protein.

Applying such strategy requires first to build a “common” English lexicon. In order to build it, we parse a collection of documents unrelated to the biomedical domain and record the frequency of the words. The quality of the lexicon built according to this procedure depends strongly on the nature of the corpus taken as input. We have to find a corpus that covers the largest variety of topics but as unrelated as possible to the medical or biological field. Given these constraints, we decide to use the Wall Street Journal corpus as the source of articles. As the articles of the newspaper primarily cover U.S. and international business as well as financial news and issues, we make the reasonable assumption that specific genomic terms will not appear frequently in the articles of the journal.

The Wall Street Journal (WSJ) is an American English-language international daily newspaper that has a worldwide daily circulation of more than 2 million as of 2006.

10%

15%

20%

25%

30%

1 2 3 4 5 6 7 8 9 >9

Percentage of the ID

Number of terms per ID

Figure 17 Frequencies of the words in the expert curators and Wall Street Journal lexicon

On the Figure 17, we clearly see that the words contained in the WSJ and the ones in the Gene Mention lexicon have particular frequency distributions. Even if this comparison is to take with caution as the sizes of the corpus are completely different, it still indicates that the gene mention lexicon possesses much more distinct words than the WSJ lexicon. It can be surprising that there are as many as 45% of unique words in the WSJ. That can be easily explained by the nature of the WSJ.

Indeed, this corpus contains many articles regarding finance that include company names and proper names that are likely to appear only once.

WSJ articles are provided under XML format. In addition to the content of the article itself, the XML includes meta-data such as the main subjects covered and the date of publication. The first operation required to extract all the words contained in the corpus is to apply tokenization. A simple tokenization using only space as separator returns around 390,000 different words. This first list of words needs to be cleaned as it still contains a lot of noise. Words representing numbers must be deleted, words finishing by specific suffixes like “s” that we consider useless for our task, must be cleaned. Moreover, as our process is case-independent, we can reduce the case of all the words to reduce the vocabulary size again. Once the numbers are removed from the list, the suffixes cleaned, and all the words set to lowercase, the number of different words types are reduced to ~240,000 words. Figure 17 shows that there is a real difference in the structure of common English corpus and biological related corpus, as biological related corpus contains a lot more unique words than common English corpus (65% .vs. 45%).

4.2.2.3.1 Overlapping words

Attempting to retrieve all the words composing the gene names by selecting every “uncommon”

English words implies that there isn’t any overlap between the words occurring in gene names and those belonging to “common” English vocabulary. Indeed, in the opposite case, the overlapping words will not be considered as elements of a gene name and consequently, reduce the recall.

Unfortunately, this expectation is too optimistic; a comparison between the words, part of gene names, and the words of the “common” English lexicon reveals that several words are shared

10%

20%

30%

40%

50%

60%

70%

1 2 3 4 5 6 7 8 9 10 >10

Percentage of words

Words absolute frequency

EC LEXICON WSJ

between the two lexicons. Indeed, many words occurring in gene names are not specific to the biological field and are retrieved as well in common English literature (words such as gene, sequence or activity, for example). Some words, parts of gene names, act as indicators about the species from which the gene has been studied; others give indications about the body part where the gene is located. For instance, words such as yeast, human, mouse, heart, lungs or liver are obviously not specific to the biological terminology. In addition, it is possible that gene names contain adjectives that , of course, can also be found in “common” English.

A comparison of the two lexicons reveals that 1,800 words are shared between the gene names contained in the EC lexicon and the “common” English vocabulary. Defining how these shared words must be considered is a tricky question as every choice has their advantages and drawbacks. On one hand, keeping these words inside the “common” English lexicon will prevent us from reaching an optimal recall, as some terms will never be selected as components of gene names. On the other hand, by considering all these ambiguous words as a constituent of gene names, we can expect a strong reduction of the precision. Indeed, as the shared words are likely to appear with relative high frequency in a context unrelated to biology, they often act as false positives and consequently, reduce the precision. A middle way can be found by keeping the most “common” terms in the English lexicon and removing those that have better chances to be part of gene names.

4.3 Methods

We saw in the state of the art that many approaches are suitable to perform a NER task. It is possible to employ dictionary method consisting in searching the documents for terms contained in a reference vocabulary. We can also use a rule-based approach that uses a specific set of rules to detect the words that are likely to be part of gene names. We decide to employ a machine-learning approach as it is the more flexible approach and because it allows combining the strength of the other approaches with a clever choice of the features. The machine-learning approach consists of using a classification algorithm to learn a set of rules (model) indicating how to combine the relevant features in order to classify the training instances at best. The model can then be applied on previously unseen instances of the test set.

In order to perform the classification using machine learning, we have to choose a classification algorithm and to represent our data set as a set of instances described by a set of features. The choice of the algorithm and of the features has a crucial influence on the performance level. The following subsections present the choices that have been made for these parameters.

Dans le document Modular text mining for protein-protein interactions extraction (Page 101-105)