• Aucun résultat trouvé

Data Description .1 Training set

6 Protein-protein Interaction Extraction

6.2 Data Description .1 Training set

The training data was derived from the content of the IntAct54 and MINT55

In principle, any of the data files from the IntAct and the MINT ftp servers would be usable to derive protein interaction pairs (with their corresponding UniProt IDs or protein accession number of UniProt), but for this sub-task, files released in 2005 and 2006 are used. Experts reviewed all the articles contained in these databases to check whether they contain interaction annotation information relevant for this database. The experts were mobilized to extract manually the protein interactions mentioned, linking each interacting protein to its corresponding unique UniProt ID or accession number.

databases. The data files of both databases are freely accessible to download and are compliant with the HUPO PSI Molecular Interaction Format.

Figure 35 Number of interactions per document in the training set

6.2.2 Test set

The interaction databases MINT and IntAct are holding back a set of curated records to produce the test set for BioCreAtIvE. Both are doing a considerable annotation effort to produce the test and training data collection. Around 300 publications are expected to be part of this test set collection.

0%

2%

4%

6%

8%

10%

12%

14%

16%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 >22

Percentage of documents

Number of Interactions

6.2.3 Terminology

We present in this section the different terminologies that are required to bring to fruition the whole protein-protein interaction extraction process. The first resource we look at is the SwissProt database that is employed to normalize unambiguously the protein mentions. Then, we present the terminologies related to the different evidence necessary to construct the interactions namely GPSDB for the protein mention, a specific handcrafted lexicon for the interactors and NEWT for the species.

6.2.4 SwissProt

SwissProt (Bairoch and Apweiler 2000) is a protein sequence and knowledge database maintained by the Swiss Institute of Bioinformatics56 (SIB) and the European Bioinformatics Institute57

6.2.5 GPSDB vs. EntrezGene

(EBI). This database has been constructed in order to offer a reliable source of information to researchers in life science. It gives access to all the available information regarding a given protein in a given species.

The annotation is done manually by a team of biologists and is mainly provided by the journal articles related to the actual sequencing of the genome and give sometimes information about the characterization of the proteins (Junker and Contrino 2000). This data set is of great value for the quality of its annotations, the usage of standardized nomenclature, a minimal level of redundancy and high level of integration with other specialized databases (Gasteiger, Jung et al. 2001). Among the information available about the protein, we can have access to its cellular location its domain structure, the posttranslational modification (PTM), how the protein evolves with time, and most importantly its role in the organism or function. Each protein entry provides an interdisciplinary overview of relevant information by bringing together experimental results, computed features, and sometimes even contradictory conclusions (Boeckmann, Bairoch et al. 2003). If we have no idea about the role of a protein, we can infer its function by looking at other proteins with a similar sequence. Regarding this mechanism, we can say that Swiss-Prot is a resource that helps to characterize newly identified proteins, for which the sequence of amino acids has been determined.

Unfortunately, this condensed information does not replace scientific articles, in the same way as information found in an encyclopedia does not replace the original texts.

Whereas in the previous chapter the purpose was to normalize the proteins regarding EntrezGene, here we have to attribute identifiers from GPSDB. By comparing the two lexicons, we realize that there are significant differences between these two terminologies. First, resources provided for the previous task contained fewer proteins than GPSDB. Indeed, whereas the provided EntrezGene lexicon contained 170,537 unique identifiers related to a protein, GPSDB contains 1,169,022 identifiers.

If we look at the coverage of EntrezGene by GPSDB, it can be surprising that 40% of the terms of EntrezGene are not contained in GPSDB (Figure 36). However, a deeper analysis of EntrezGene reveals that among the protein and gene names provided, many are, in fact, noise that has been

filtered in GPSDB. For instance, among the terms provided in EntrezGene, many are linked to other databases or concern information about locus and are not taken into account in GPSDB.

Figure 36 Proportion of EntrezGene terms found in GPSDB

Another difference between the two lexicons concerns the distribution of the number of terms related to a single ID. Looking at such information reveals the propensity of the proteins recorded in the lexicon to possess many synonyms.

Figure 37 Number of terms per ID given the GPSDB and EntrezGene Lexicon

If we look at the number of terms attached to each unique ID, we realize that the distributions in the two lexicons are different (Figure 37). We observe that, on average, the number of terms related to an identifier of GPSDB is greater than to an identifier of EntrezGene. As GPSDB is built by merging

60%

40%

Found also in GPSDB

Only in Entrez Gene

0%

5%

10%

15%

20%

25%

30%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 >20

Number of terms per ID

GPSDB ENTREZ GENE

several databases, a single identifier is likely to contain many different names. On the contrary, EntrezGene is a more centralized resource that contains more often only one way to name a protein.

The last and most important difference between these two lexicons is related to the species. Indeed, whereas the provided EntrezGene entries were only related to the human proteins, GPSDB covers the proteins studied in every species. This makes the normalization much harder because such information has to be taken into account during the process. Indeed, if the wrong species is chosen for the document, the proteins identified in this document do not receive the right identifier, although they have been properly recognized in the text.

6.2.6 Interactors

Finding some protein names in a sentence is not sufficient to infer the presence of an interaction. In order to confirm the existence of an interaction in a sentence we also need to find a word that indicates the action of one protein on one other. This specific word is usually a transitive verb but some nouns can give interesting information as well.

We already spoke about interactors in the chapter 3, but let us remind you what we understand by

“interactors”. In most cases, an interaction in the text appears under one of the following three patterns:”X interacts with Y”, “X and Y interact” or “the interaction between X and Y”. In all these situations, the word “interact” can be seen as a trigger indicating the presence of the interaction.

Obviously, “interact” is not the only word that can act as a trigger, many others, like “bind”,

“catalyse”… (cf. complete list Table 51) have the same properties and therefore have to be considered as an “interactor’s word”

6.2.6.1 Verbs

An important step part of the protein-protein interactions' detection process is the localization of transitive verbs. According to the premise that the structure that defines an interaction is a transitive relation between a protein (subject) that acts over another protein (object), transitive verbs play a central role in the identification of such structure. However, not all transitive verbs must be taken in consideration, only a very specific group of transitive verb acts as strong clues to indicate the presence of an interaction. In order to decide which ones are likely to reveal an interaction, we have manually parsed the training data and selected all the verbs that are used to describe the interactions. As results of this manual parsing, we gathered a list of 27 verbs that we regroup in the vocabulary.

Table 51 selected verbs required to trigger an interaction

We do not only insert the verbs at their third singular form of the simple present but also in the plural and simple past forms. Although we consider this set of verbs as very strong indicators, we

have to be careful not to use them blindly. Indeed, some of these verbs possess several meanings depending on their context. For example, the word “complex” can be used as a verb to describe an interaction or as a noun and does not indicate the presence of an interaction.

Based on the verbs indicating the probable presence of interactions, we can also derive some nouns that act as well as clues for the presence of interactions. For example, “activates” can be derived from “activation “and in the same manner, “interaction” can be derived from “interacts”.

In order to have an idea about the quality of these transitive verbs, we perform a simple evaluation based the data provided in the former chapter. In the first chapter, the purpose was to distinguish the documents likely to contain protein-protein interactions from the others. To perform this classification, we employed different kind of features. Among the possible representation of the documents that we used, we employed the transitive verbs. Based on this representation and on the data of the first experimental chapter, we perform a Chi2 feature selection on the transitive verb.

Such procedure gives us an idea about the discriminative power of these verbs and indirectly a measure of quality for these verbs.

Figure 38 Chi2 Feature selection score of the transitive verbs based on the data of the binary classification task

On the Figure 38, we observe that by doing a feature selection on the transitive based on the data provided for the binary classification task; very few verbs are selected with strong confidence.

6.2.6.2 Interactors location

The discovery of an interaction verb in a sentence gives an indication about the presence of a possible interaction. However, this is not the only information that the verb can provide. Depending on the verb, we can also infer the location of the interacting proteins in the sentence. For instance, for the transitive verbs, we know that the interacting proteins are located before and after the verbs.

For the nouns, the situation is a bit more complex. Let us remind you that we have three different situations:

0 2 4 6 8 10 12

interaction binding interacts binds interact complex interactions bind association complexes activation associates interacting phosphorylation associate mediate regulate phosphorylated regulates activate form activated forms regulator link regulation activates modulate transfected complexed inhibit promotes target enhances promote localize regulated modulates block abolish accelerate assemble suppress stimulate ligate

Chi2 Score

• Protein 1 protein 2 interaction (XXX and YYY bind together )

• Protein 1 interaction protein 2 (XXX interact with YYY)

• Interaction protein 1 protein 2 (the regulation of XXX by YYY)

As we mentioned before, in the case of a transitive verb we fall on the situation 2. With the nouns, we cannot decide easily in which situation is the more likely as all the cases are possible. For instance, for interactions we can have the three following sentences:

• Protein A and B form an interaction

• The interaction between protein A and B

• The protein A participates to an interaction with protein B

In order to discover the most likely configuration, it is interesting to look if there is a preposition following the noun. Depending on this preposition, one can deduce the more likely location of the proteins in the sentence and possibly improve the performances.

6.2.7 Species

It is not rare that some proteins and genes are common for several species. Indeed, we know that we share an important part of our genome with other animals; therefore, a single gene or protein can be located in numerous organisms. By analyzing the context of GPSDB, we realize that 88% percent of the proteins listed in the lexicon are only studied in one species. The remaining 12% of the proteins are referenced for at least two organisms and can be in extreme case been studied in more than thousand species.

Figure 39 Cumulate frequency of the most common specie in GPSDB

As protein names are ambiguous regarding to the species, the databases' identifier related to proteins and genes are dependent of the species. Consequently, in order to link a protein name to a unique ID, we must not only know the name of the protein but to which species the protein is linked.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

0.01% 0.19% 0.38% 0.56% 0.74% 0.92% 1.10% 1.28% 1.46% 1.65% 1.83%

percentage of protein covered

percentage of species covered

In order to identify the species from which the proteins/genes are extracted, we can, for instance, search for a direct mention of any species in the document. It requires a lexicon containing all the names of the existing species. A terminology containing such information is NEWT. NEWT is a taxonomy that organizes in a hierarchical way the information about more than 6,500,000 known species. Fortunately, we do not need to be able to recognize all these species because we are only interested in the species that are related to at least one protein contained in GPSDB.

Among the 525,647 species contained in NEWT, only 8,261 appear as related to at least one of the proteins/genes contained in GPSDB. One can notice that among these 8,261 species 2% of the species (165) cover 80% of the proteins that appear in GPSDB (Figure 39). Although it is a very strong reduction regarding to the initial number of existing species, it still can cause many problems during the normalization.

6.2.7.1 Cross species interactions problem

If we look at the species as they appear in the interactions of the training set, we observe some interesting phenomenon. Surprisingly, the proteins participating to an interaction do not obviously belong to a unique species. The reason of such an unexpected fact is due to practical needs of experimentations. If researchers are interested in studying a protein that is common in several organisms, they can choose to extract the protein from any of these organisms as it behaves in the same way in all the species. However, they have usually preferences among the species as it is sometimes more convenient to extract the proteins from one species than from another. For instance, it is generally simpler, for technical reasons, to extract proteins from mice than from humans. Consequently, for a protein shared by the mice and the humans, the protein is extracted from the mouse although the other interacting protein comes from humans.

Figure 40 number of species involved in one interaction

In the case of an experiment concerning an interaction between two human proteins, if one of them is common with the mouse, the researcher will tend to extract the shared protein from the mouse as it is more convenient and the other one from the human if there is no other choice.

82%

18%

1 2

We see in the Figure 40 that in 18% of the interactions, two species are involved. These interactions will be particularly difficult to be extracted since extracting them requires being able to put in relation several species with several proteins using the information contained in the text.

6.3 Methods

Extracting protein-protein interactions from textual data is a complicated task. Two main families of methods are usually employed to resolve this task, the co-occurrence methods and the methods relying on pattern matching. As we rely on a relatively small number of documents (less than 1,000), a method based on co-occurrences is not adapted to solve our problem. Indeed, co-occurrence method detects works based on the idea that interacting proteins co-occur more frequently than randomly chosen proteins. Such co-occurrence statistic is reliable only if it is computed on a large number of documents.

Our approach is more inspired by the pattern matching technique. Indeed, we attempt to discover the interactions by identifying several evidences in the text and by combining them to form the interactions. The three main evidences required to generate an interaction are the protein names, the interaction verbs, and the species. In this section, we present the different methods that have been used to extract these evidences and how they have been combined to construct the interactions

Let us enumerate here the different steps that lead to the extraction of the interactions:

• Recognition of the interactors: The fundamental element that triggers an interaction is a verb or a noun of specific type indicating a probable interaction

• Recognition of the protein using NER or dictionary method: All the potential proteins occurring in the documents are recorded but no normalization is done at this point since it requires knowing first what species are linked to these proteins.

• First selection of the proteins: As we are only interested to the proteins participating to an interaction, all those that are not found in a sentence containing an “interactors” are dismissed

• Searching for the species in the documents: The documents are parsed to retrieve species names as existing in the NEWT terminology

• Searching for the species related to the selected proteins: As not all documents contain textual mention about the species from which the proteins are extracted, we infer the possible species given the selected proteins.

• Normalization: Once the species discovered, the protein names can be finally be normalized

• Generating interactions: For each sentence containing at least one word indicating the presence of an interaction and two species we generate an interaction.

6.3.1 Extracting interactors

Extracting interactors is a straightforward process. As interactors are weakly ambiguous, we simply search the documents for target words using a perfect match method.

6.3.2 Extracting protein names

One of the essential evidence that must be found to construct the protein-protein interactions are of course the proteins themselves. In order to study the importance of the quality of the extracted named entities, we compare two different techniques. First, we test a simple dictionary method that consists to search the text for perfect occurrences of the terms' part of the GPSDB lexicon. Then we use a more recall-oriented technique that has been already used in the previous chapter.

6.3.2.1 Protein names Extraction using dictionary method

The principle of a dictionary method is to search documents for the occurrences of the terms of a controlled lexicon. This technique has the advantage of finding high quality matches but is not able to manage efficiently the numerous variations present in the literature. Despite this limitation, researchers are usually motivated to use a dictionary for NER over other more complicated techniques, such as machine learning, because it allows direct access to identifiers. Unfortunately, in our problem, the recognized protein names cannot be unambiguously identified.

To be able to normalize completely the terms, it is insufficient to discover the presence of a term, we also need to identify to which species the protein is related. Anyway, although the full disambiguation problem is not solved, the dictionary method is adequate to extract simply the protein name.

6.3.2.2 Protein names extraction using fuzzy metric

We saw in the previous chapter that recognizing the protein mentions with CRF produces the best balance between recall and precision. However, despite the quality of the results obtained using the CRF, we decide to use another method to extract the protein mentions. In the context of our task, we assume that the level of recall is of greater importance than the precision. There are two main reasons why we made such an assumption. First, recognizing an interaction requires the identification of several proteins. If we are able to retrieve 50% of them, it can lead to a situation where we miss all interactions. Indeed, if, for each potential sentence, we succeed to identify only

We saw in the previous chapter that recognizing the protein mentions with CRF produces the best balance between recall and precision. However, despite the quality of the results obtained using the CRF, we decide to use another method to extract the protein mentions. In the context of our task, we assume that the level of recall is of greater importance than the precision. There are two main reasons why we made such an assumption. First, recognizing an interaction requires the identification of several proteins. If we are able to retrieve 50% of them, it can lead to a situation where we miss all interactions. Indeed, if, for each potential sentence, we succeed to identify only