A guide to chapters - Modular text mining for protein-protein interactions extraction

This guide to chapters provides the reader with an overview of the structure of the thesis and a summary of its content.

1.3.1 State of the art

Delivering a complete state of the art describing all concepts and techniques related to the protein extraction process is a challenging task. Indeed, the different steps that compose the protein-protein interaction extraction process are related to a large scope of important principles. Given the extensive scope of the state of the art, we divided it in three broad sections, each of them covering a specific topic related to one of the experimental chapters of the work.

Independently of the covered topics, all the sections contained in the state of the art chapter have a similar structure. First, we introduce the topic, its context and its importance as a standalone task.

Then we focus on the specific problems linked to the subject, and finally we present the different methods that exist to solve the problems.

• The first topic we address in this state of the art is automatic text classification. This subject is clearly linked to our work since the ability to distinguish between relevant and irrelevant facts is crucial to respond to the document classification challenge. In this section, we cover issues such as the use of automatic text classification, the vector space model, the feature selection techniques, the discretization issue and the class imbalance problem. The section ends with a chronological review of the technique employed in the field.

• The second topic covered in the state of the art is directly related to named entity recognition and named entity normalization tasks. We start by describing the major difficulties associated with the recognition and normalization of biological named entities.

These difficulties cover the common use of many homonyms, synonyms as well as the difficulty to detect the boundaries of the multi-word proteins/genes. Then, we present the most usual methods used to solve such problems.

• The last part of the state of the art describes the challenges and techniques related to the extraction of protein-protein interactions. We explain the main experiment performed by biologists to generate the interaction. We also cover the problem related to the extraction of these interactions from the literature and finally end with the presentation of the main techniques in the field.

1.3.2 Experimental chapters

The four experimental chapters cover the four different steps designed as components of the PPI extraction process. The overall procedure starts from an initial collection of documents and ends with a list of interacting proteins. We attempt to reflect the different stages of this process in the structure of the thesis.

The first experimental chapter focuses on the identification of the documents that are likely to contain information relative to PPIs. We consider this problem as a binary classification task where

the documents must be classified as relevant or not, depending on the information they contain concerning protein interaction. It is well known that a crucial element influencing the success or the failure of classification is the choice of a good feature set. We conduct our experiments in order to validate the hypothesis that we can find a new set of features that offer a better generalization power than words. On one hand, we test a set of lexical features that intend to represent the documents using high-level concepts extracted from controlled lexicons. On the other hand, we extract some task specific features that only make sense for discriminating the documents with regards to the particular target of the classification. The experimental paragraph presents an exhaustive analysis of the performance obtained by using a different set of features.

The next chapter entitled “Gene Mentions” concerns a typical Named Entities Recognition (NER) task. As no PPI can be extracted without identifying previously the mentions of genes and proteins in the documents, this step is crucial in the interactions identification process. In our approach, we consider the identification of the entities in the text as a classification task where each word has to be classified as constituent of a gene name or not. In the context of this task, we raise several interesting questions relative to the possible strategies that can be used to perform the identification of the mentions. Most of the existing approaches handle the problem using a sequential learning algorithm. However, there are no strong reasons that prevent us from using a non-sequential algorithm. Therefore, we assume that competitive performance using a non-sequential algorithm can be reached depending on the set of features adopted to represent the instances. Let us remind that, whereas the non-sequential algorithm represents every document as a bag of independent words, the sequential algorithm represents the documents as a sequence of words and therefore adds relevant information to the model. As explained earlier, comparing these algorithms require the generation of several feature sets to describe all possible candidates. We perform a broad research for possible discriminative features and finally regroup them in three different categories:

• The orthographical features carrying information related to the internal structure of the words,

• The syntactical features describing the words given their role in a sentence,

• The lexical features regrouping all the features describing the relationships that can exist between the words of the documents and those of a controlled lexicon.

In addition to the research for an efficient strategy to perform the task, we study the influence of the documents set quality on the NER task. More specifically, we compare the performance depending on whether the input documents were previously classified or not.

The task performed in the third experimental chapter is in direct continuation of the second one as we attempt to identify unambiguously the genes mentioned in the text. The easiest way to assign an identifier to a gene is to perform a mapping to a controlled vocabulary. Unfortunately, given the lack of conventions in the way of naming genes and proteins, performing this mapping is not as simple as it seems. As numerous variants are created constantly, no lexicon can ever be completely up-to-date.

Regarding this reality, we assume that the normalization performance can be improved by increasing the coverage of the controlled lexicon using automatically-generated term variants. Performing the normalization obviously requires a controlled lexicon to gain access to unique identifiers but also requires the careful choice of terms from the documents as likely candidates. Resolving the normalization involves analyzing the influence of the NER task on the normalization task and

therefore, allows exploring the relationship between these two stages. To study this influence, we compare the performance obtained whether the mentions selected from the documents were identified using a simple technique, or if they were located using conditional random field.

Finally, the last experimental chapter covers all issues related to the protein-protein interactions (PPI) extraction. To execute this task, we were inspired by information extraction techniques and defined simple patterns to identify the interactions. The patterns that identify the interactions not only rely on the presence of several proteins but also require the location of a transitive verb and the identification of the species from which the proteins are studied. The tight relationships existing between the numerous pieces of evidence implicated in the resolution of the problem offer an adapted framework to validate the assumption that the overall performance of a complex process is not only dependent on the individual success of each sub-task but is strongly influenced by the link that exists between these sub-tasks.

We present the different strategies that we use to extract this information. For the protein name mention, we rely on two different dictionary methods. A simple method that searches the text for the exact occurrence of terms from the controlled lexicon and another one that first selects the most likely candidates and then matches them against the controlled lexicon using fuzzy metric. In order to identify the species from which the proteins are extracted, we either identify directly the species in the abstracts using textual evidence, or, when no evidence is found, infer the best species given the proteins that have been identified in the documents. Finally, we simply identify the transitive verb based on the occurrence of the terms contained in a predefined list. All these pieces of evidence can then be combined to construct the interactions. In the experimental part, we intend to compare the influence of the protein recognition strategy on the overall performance. Nevertheless, we are also interested in analyzing independently the influence of other extracted information on the process.

We conclude our thesis by summarizing the main achievements. The hypotheses we set all along the chapters trigger interesting discussions. As one of our main assumptions concerns the links between the sub-tasks of a complex process, we naturally mention the importance of context and minimal commitment. Moreover, the use of many innovative features sets lead us to argue about the importance of controlled lexicon and expert knowledge in the PPI extraction process. Finally, some prospective discussion is initiated to provide some clues about future research.

Dans le document Modular text mining for protein-protein interactions extraction (Page 37-40)