• Aucun résultat trouvé

6 Protein-protein Interaction Extraction

7.3 Gene normalization

Recognizing genes/proteins mentions in the text is not sufficient to store information in a structured way in database. Indeed, as mentioned before, there are usually many different spellings to refer to a unique gene or protein. Such ambiguities hinder the existing relations between the mentioned genes/proteins and the normalized entries of molecular biology databases. Our approach of normalization relied principally on the use of string matching methods. Since, trying to match all words and words variants in a document against a sizeable dictionary such as GPSDB would be computationally too costly, the first decision that had to be made before performing comparison is

to choose which document features are worth to be matched against those contained in the reference lexicon. Such a preliminary step provides an opportunity to assess our general assumption concerning the relative role played by each module when performing complex tasks. Indeed, let remains that we formulate a hypothesis enouncing that the consistence of the different modules part of a process is as important as the performance of the modules taken independently. It led us to open interesting questions about the influence of the quality of the candidates extracted from the documents on the normalization. There was a delicate balance to find regarding to the way we selected the candidates. If we were too restrictive in choosing the candidates in the documents, we were not able to reach a competitive level recall. On the other hand, if we used a liberal method, much more correct normalizations were generated but many irrelevant ones as well and therefore induced a low precision. The basic candidate set that we considered for this task is constituted by the words that are employed exclusively in biomedical context. Using such candidate set highlighted the importance of the words shared between the “common” English vocabulary and the terminology related to gene and protein names. Indeed, by considering or not this small subset of words as valid candidates, we had a strong impact on the precision as well as on the recall. By removing these ambiguous words from the candidate set, we had an increase of precision over 25%. As the loss in recall was limited to 4% to 12% (depending of the training and test set) the overall operation was beneficial. It made us aware that it is very important to be able to threat differently these words depending of the context. As taking the context in consideration is exactly what can be done with the help of Conditional Random Field, we applied the method that we developed in the previous chapter.

We observed that employing such method to select the candidates allowed reaching a good balance between recall and precision. Indeed, compared to the simple exclusion of ambiguous words, the loss in precision induced by the use of conditional random field was slightly smaller (1% to 10%

depending of the training and test sets) and the recall increased more strongly and goes higher than 35%.

The second experiment we performed attempted to valid the hypothesis that it is possible to improve the overall F-measure of the normalization task by augmenting the coverage of the reference lexicons via automatically generated term variants. By creating such variants, we wanted to be able to generate more highly confident matches between the controlled lexicon and the terms found in the text. We generated four types of variants. The first ones were based on the presence or not of hyphen. The second split the multi-words terms when a punctuation mark or parentheses were found. The third type of variant removed all the ambiguous words and finally the last set of variants were generated using specific rules developed empirically by observing the literature. When analyzing the results, we realized that the only method that did not harm the precision (1% of increase) while increasing the recall (1.9% of increase) was the use of expert generated rules. All the other produced variants (Hyphen, split, unambiguous) despite their beneficial effect on the recall (from 1.5% to 5.8% of increases) reduced the precision quite strongly (from 4.9% to 12.1%). It underlines the importance of including expert knowledge to extend in a cleaver way the reference lexicon.

7.4 PPI extraction

The last chapter reports on our attempt to manage the core of the problem, the generation of the interactions. We do not remind here the importance of extracting such interactions as we already spoke about it before. The difficulties met in this chapter gave us the confirmation that extracting

protein-protein interactions is a tough task. Indeed, it requires not only to answer the challenges that have been covered until now, but also adds a new level of complexity. The most salient example is for the normalization. Whereas in the previous chapter we were limited to deal with the proteins related to human, here we are interested in every organism. Such additional dimension made the normalization much harder as it not only required to finding the protein mentions accurately but, as well, the specific species linked to each protein. In order to construct the interactions, we relied on the idea that they can be detected using simple patterns such as “X interacts with Y”. Such approaches required identifying the transitive verbs that we considered as trigger for the presence of an interaction, the proteins in the document and the species.

The complexity inherent to the modular structure of the protein-protein interactions extraction task was well suited to validate or invalidate our main assumption stipulating that the overall results do not only depend on the performance of the modules involved in the process taken independently, but also strongly on the consistency of the relationships achieved between these modules. To validate this hypothesis, we attempted to analyze independently the influence of the accurate identification of the different evidences required in the whole process on the quality of the overall results. The observations made all along our experiments revealed that the performance obtained in extracting each evidences is strongly influenced by the success in extracting the others. Therefore, it tends to corroborate our hypothesis that all the subtasks of a process are interdependent. Indeed, once the protein mentions and the transitive verbs are identified, the initial set of proteins is filtered by dismissing the proteins that do not occur in any of the generated interaction patterns. This first selection influences the retrieval of the species because the choice of the possible species is based on the proteins that have been selected in the text. The selected species influences at its turn the normalization of the proteins and finally influences the generation of the interactions. Understanding this tight relationship made us conscious that each time an error is done at one-step, it influences all the steps downstream and finally reduces strongly the precision and recall of the generated interactions. In this regard, this chapter really revealed the importance of seeing the tasks as part of a strongly interconnected process rather than independently. Given the strong interconnection between the subtasks, it is very difficult to determine the real source of the errors. However, we can still have a broad idea by looking separately at every evidences extraction process. At the level of the detection of the transitive verbs, we noticed that many of these verbs lead to incorrect interactions.

Only a small percentage of the verbs such as “forms” or “modify” trigger only positive interactions.

Unfortunately, most of the others verbs have lead to the production of a large proportion of erroneous interactions (in the mean more than 80% of the produced interaction are incorrect).

Nevertheless, as we told, we do not know if it is due to the incorrect pattern or to the incorrect normalization. At the level of the species, we observed that by returning the most likely species, we only retrieve around 40% of them and give a correct answer 63% of the time. By taking more species in consideration, we increase the recall up to 73% but it sacrifices strongly the precision until 18%.

On the side of the proteins extraction method, we compared two methods, one consisting to search the proteins mention based on perfect match between the text and the reference lexicon, the other using fuzzy distance. We remarks that so many errors are involved earlier in the other steps of the process, that finally the use of a more efficient method to identify the proteins mentions didn’t make a big difference at the end.