• Aucun résultat trouvé

Spotlight on GO and UMLS

Dans le document Data Mining in Biomedicine Using Ontologies (Page 27-30)

Within the domain of biology and medicine, the two resources of GO and UMLS have arguably had the greatest impact [31]. As an introduction to where they are referenced in the other chapters in this book, we put a spotlight on the key aspects of these resources.

1.5 Spotlight on GO and UMLS 11

1.5.1 The Gene Ontology

At the time of its conception, the need for GO was powerful and straightforward:

different molecular-biology databases were using different terms to describe im-portant information about gene products. This heterogeneity was a barrier to the integration of the data held in these databases. The desire for such integration was driven by the advent of the fi rst model organism genome sequences, which provided the possibility of performing large-scale comparative genomic studies. GO was rev-olutionary within bioinformatics because it provided a controlled vocabulary that could be used to annotate database entries. After a signifi cant amount of investment and success, GO is now widely used. The usage of GO has expanded since its use for the three original genome database members of the consortium, and it has now been adopted by over 40 species-specifi c databases. Of particular note is the Gene Ontology Annotation (GOA) Project, which aims to ensure widespread annotation of UniProtKB entries with GO annotations [32]. This resource currently contains over 32 million GO annotations to more than 4.3 million proteins, through a com-bination of manual and automatic annotation methods.

GO is actually a set of three distinct vocabularies containing terms that describe three important aspects of gene products. The molecular function vocabulary in-cludes terms that are used to describe the various elemental activities of a gene product. The biological process vocabulary includes terms that are used to describe the broader biological processes in which gene products can be involved and that are usually achieved through a combination of molecular functions. Finally, the cellular component vocabulary contains terms that describe the various locations in a cell with which gene products may be associated. For example, a gene product that acts as a transcription factor involved in cell cycle regulation may be annotated with the molecular functions of DNA binding and transcription factor activity, the biological processes of transcription and G1/S transition of mitotic cell cycle, and the cellular location of nucleus. In this case, these terms are independent of species, and so gene products annotated with these terms could be extracted from many different species-specifi c databases to facilitate comparative analysis in an investi-gation into cell cycle regulation. GO does contain terms that are not applicable to all species, but these are derived from the need for terms that describe aspects that are particular to some organisms; for example, no human gene products would be annotated with cell wall.

The process of annotating a gene product is the specifi cation of an assertion about that gene product. Because of this, GO annotations cannot be made with-out some sort of evidence as to the source of the assertion. For this, GO also has evidence codes that can be associated with any annotation. There are two broad categories of evidence codes that distinguish between whether the annotation was made based on evidence that was derived from direct experimentation, such as a laboratory assay, or whether it was from indirect evidence, such as a computational experiment or a statement by an author in which the evidence is unclear. Annota-tions should always include citaAnnota-tions of their sources. When annotaAnnota-tions are being used for data mining, the type of evidence can be an important discriminatory factor.

As of March 2009, the GO Web site states that GO includes more than 16,000 biological process terms, over 2,300 cellular component terms, and over 8,500 molecular function terms. The curation (i.e., term validation) process means that almost all of these terms have a human-readable defi nition, which is important for getting more accurate annotations from the process. These terms may also have other relevant information, such as synonymous terms and cross references to oth-er databases.

As the number of databases and data from different species and biological domains increases, so does the demand for more specifi c terms with which gene products can be annotated. The GO consortium organizes interest groups for spe-cifi c domains that are intended to extend and improve the terms in the ontology.

The terms in the ontologies are curated by a dedicated team, but requests for modi-fi cations and improvements can be requested by anybody, and so there is a strong sense of community development. The style of terms in the gene ontology is highly consistent [33]. Nearly all of the terms in the GO biological process to do with me-tabolism of chemicals follow the structure “<chemical> meme-tabolism | biosynthesis

| catabolism.” Such a structure aids both the readability and the computational manipulation of the set of labels in the ontology [33, 34].

In data mining, GO is now widely used in a variety of ways to provide a func-tional perspective on the analysis of molecular biological data. The analysis of microarray results through analyzing the over-representation of GO terms within the differentially represented genes (e.g., [35, 36]) is a common usage. Other im-portant examples include the functional interpretation of gene expression data and the prediction of gene function through similarity. The controlled vocabulary speci-fi ed by GO also has useful applications in text mining. Specispeci-fi c examples of these and other uses are detailed in Chapters 5, 6, and 7.

1.5.2 The Unifi ed Medical Language System

As mentioned previously, many biomedical vocabularies have evolved independently and have had virtually no coordinated development. This has led to much overlap and incompatibility between them, and integrating them is a signifi cant challenge.

The Unifi ed Medical Language System (UMLS) addresses this challenge, and has been a repository of biomedical vocabularies developed by the U.S. National Li-brary of Medicine for over 20 years [37, 38]. UMLS comprises three knowledge sources: the UMLS Metathesaurus, the Semantic Network, and the SPECIALIST Lexicon. Together, they seek to provide a set of resources that can aid in the analy-sis of text in the biomedical domain, from health records to research papers. By coordinating a wide range of vocabularies with lexical information, UMLS seeks to provide a language-oriented knowledge resource.

The Metathesaurus integrates over 100 vocabularies from a diverse set of bi-omedical fi elds, including diagnoses, procedures, clinical observations, signs and symptoms, drugs, diseases, anatomy, and genes. Notable resources include SNOM-ed-CT, GO, MeSH, NCI Thesaurus, OMIM, HL7, and ICD. The Metathesaurus is a set of biomedical and health-related concepts that are referred to by this diverse set of vocabularies, using different terms. A UMLS concept is something in bio-medicine that has common meaning [39]. UMLS does not seek to develop its own

Dans le document Data Mining in Biomedicine Using Ontologies (Page 27-30)