• Aucun résultat trouvé

3 Binary Classification

3.2 Data Description

3.2.1 Documents collection

3.2.2.3 UMLS Semantic types

over 4,600 international biomedical journals.

3.2.2.2 MeSH terms

Medical Subject Headings (MeSH) is a large sized controlled vocabulary created and updated by the United States National Library of Medicine (NLM). The purpose of this vocabulary is to index and search the journal articles, books, journal titles, and non-print materials in NLM's catalogue stored in MEDLINE. Understanding MeSH requires an understanding of its structure. There are three major components composing the Medical Subject Headings: the Headings themselves, the Subheadings (also known as Qualifiers), and the Supplementary Concept Records. Every MEDLINE article indexed with MeSH terms encompasses usually 10 to 15 headings or subheadings. The terms describing the main topics of the articles are designed as major terms and are marked with an asterisk. Most entries come with a short description. These descriptions are written by a team of specialist based most of the time on information extracted from encyclopedia and standard textbooks of the subject area.

References for specific statements in the descriptions are not given; instead, readers are referred to the bibliography. In addition to the main descriptor, MeSH contains a small number of standard qualifiers (also known as subheadings), which can be added to descriptors to narrow down the topic.

For example, "Measles" is a descriptor and "epidemiology" is a qualifier; "Measles/epidemiology"

describes the subheading of epidemiological articles about Measles. The "epidemiology" qualifier can be added to all other disease descriptors. Not all descriptor/qualifier combinations are allowed since some of them may be meaningless. In all, there are 83 different qualifiers.

The hierarchical organization of the terms is a fundamental characteristic of the MeSH vocabulary and can offer interesting advantages for retrieval. The hierarchy is organized as a tree structure possessing actually nine levels of increasing specificity. This structure makes it possible to browse efficiently the vocabulary to search the most appropriate term. It also offers the basis for query extension and the possibility to search for more specific topics. Indeed, by default all the descriptors below the given one in the hierarchy are included in the search. Using a thesaurus such as MeSH or UMLS, descriptors not only organize the document space in terms of semantically related groups, but also enhance search procedures by expanding queries by synonyms, more or less general terms, related terms, etc (Brank, Grobelnik et al. 2003). The 2009 version of MeSH contains 25,186 subject headings, also known as descriptors27

3.2.2.3 UMLS Semantic types .

Semantic types are broad categories used in the Unified Medical Language System (UMLS) meta-thesaurus. Each concept in the Meta-thesaurus is assigned to at least one "Semantic type" (a category), and certain "Semantic relationships" maybe obtained between members of the various Semantic types. The information attached with a Semantic type includes an identifier, definition and examples. The Semantic relationships are mostly hierarchical and associational. All the 135 semantic types (Table 1) and their 54 relationships are regrouped in a catalogue called a Semantic network.

The major semantic types are organisms, anatomical structures, biologic function, chemicals, events, physical objects, and concepts or ideas28

27

. The links among semantic types provide the structure for

28

the network and show important relationships between the groupings and concepts. The primary link between semantic types is the "is a" link, establishing a hierarchy of types and allowing locating the most specific semantic type to assign to a given Metathesaurus concept. The network also has 5 major categories of non-hierarchical (or "associational") relationships. These are "physically related to", "spatially related to", "temporally related to", "functionally related to" and "conceptually related to".

Activity

GPSDB is a database based on the main existing biological resources of gene/protein names. It has been created in the context of BioMinT29, a European project that aimed to develop a generic text-mining tool to assist Swiss-Prot (Boeckmann, Bairoch et al. 2003; Bairoch, Boeckmann et al. 2004 ) and PRINTS30

15 existing resources were used to populate the database, (

(Attwood, Bradley et al. 2003) curators in their protein annotation activity in a semi-automatic way. A similar project called GENA constructed a gene/protein name resource for a dictionary-based name recognition tool (Koike and Takagi 2004). However, this resource is limited to eukaryotic model organisms. On the contrary, the GPSDB database collects gene/protein names from 15 main biological resources covering many different species. It is possible to make queries in GPSDB to obtain information about all genes and proteins contained in the database, through a web interface. By entering a gene/protein name it not only retrieves a list of synonyms but looks also into MEDLINE with a set of user-selected terms to enable the recovery of a maximum of valid publications about a specific gene/protein (Pillet, Zehnder et al. 2005).

Table 2): EntrezGene and Swiss-Prot for multispecies; HUGO and OMIM31 for Human; MGD for Mouse; RGD32 and Ratmap33 for Rat; Flybase34 for Drosophila; SGD35 for Saccharomyces cerevisiae; TAIR for Arabidopsis thaliana; WormBase36 for Caenorhabditis elegans; SubtiList37 for Bacillus subtilis; EcoGene38 for Escherichia coli; dictyBase39

Dictyostelium discoideum and ZFIN for Dario rerio. From each database, specific fields were extracted (official name, symbol name, synonyms, database cross-reference links, species name, entry ID, etc.).

The resulting database contains 2,885,697 different synonyms describing 531,541 protein entities.

The total number of species/subspecies taken into account exceeds 11,000. GPSDB is updated every three months.

Database Covered Species 1 EntrezGene multispecies 2 Swiss-Prot multispecies

3 HUGO Human

4 OMIM Human

5 MGD Mouse

6 RGD Rat

7 Ratmap Rat

8 Flybase Drosophila

9 SGD Saccharomyces cerevisiae

10 TAIR Arabidopsis thaliana 11 WormBase Caenorhabditis elegans 12 SubtiList Bacillus subtilis

13 EcoGene Escherichia coli

14 dictyBase Dictyostelium discoideum

15 ZFIN rerio

Table 2 Databases employed to populate GPSDB

3.3 Methods

The aim of the experiments presented in this chapter is to select the documents that are likely to contain relevant information about protein-protein interactions. In order to identify these documents, a common approach consists to perform a binary classification of the whole collection using a machine-learning algorithm. Performing text classification using machine learning requires to choose the most suitable learning algorithm. Such algorithm learns a classification model using a training set, and then uses this model to classify unseen examples of the test set.

Another key for a successful learning process is the choice of an adapted document representation.

Indeed, most of the machine-learning algorithms learn from a specific representation of the document. Depending on the representation, an algorithm can be more or less efficient to separate the relevant from the non-relevant documents. Usually the documents are represented using the

“bag of words” model. In the following section, we present other representations hoping they can improve the classification performance.

38

39