MISE EN ŒUVRE ET DISCUSSION - INTÉGRATION POUR L’ANALYSE DU TRANSCRIPTOME

IV. INTÉGRATION POUR L’ANALYSE DU TRANSCRIPTOME

2. MISE EN ŒUVRE ET DISCUSSION

I. BIOMEKE POUR L’ANNOTATION

BIOMÉDICALE DE GÈNES

A A R R T T I I C C L L E E 1 1

BioMeKE : a UMLS-based system useful for biomedical annotation of genes

G. Marquet, E. Guérin, O. Loréal and A. Burgun

[Article en révision pour publication dans la revue Bioinformatics]

Vol. 00 no. 0 2005, pages 1–5 doi:10.1093/bioinformatics/bti283

BIOINFORMATICS

Databases and Ontologies

BioMeKE: a UMLS-based system useful for biomedical annotation of genes

Gwenaëlle Marquet ^1*, Emilie Guérin ², Olivier Loréal ², Anita Burgun ¹

1 EA 3888, IFR 140, Université de Rennes 1, Faculté de Médecine - 35043 Rennes Cedex - France

2 INSERM U522, IFR 140, Université de Rennes 1, CHRU Pontchaillou - 35033 Rennes Cedex - France

ABSTRACT

Summary: The Unified Medical Language system (UMLS) is a potential resource for providing associations between genes and medical knowledge, which may complement Gene Ontology (GO) annotation. We present BioMeKE (BioMedical Knowledge Extraction system), a UMLS-based annotation system that exploits the relations present in the UMLS. An evaluation of the system on a set of 43 genes known to be involved or not in iron metabolism has shown the interest of this method, for providing association between genes and medical conditions. In conclusion, BioMeKE is useful to study biomedical information related to large lists of genes such as those obtained using high throughput technologies.

Availability: BioMeKE is freely available via Java Web Start at http://www.med.univ-rennes1.fr/biomeke/

Contact: gwenaelle.marquet@univ-rennes1.fr

Supplementary information: http://www.med.univ-rennes1.fr/biomeke/suppinfo.php

1 INTRODUCTION

Functional annotations of genes as well as gene-disorder relations play a major role for analyzing data obtained using high throughput technologies. Gene Ontology™ (GO) annotation represents (The Gene Ontology Consortium 2000) the molecular functions, biological processes, and cellular components associated with genes and gene products. GO annotation does not provide information on pathologic conditions and disorders that have been associated with genes. The Unified Medical Language System^® (UMLS) is a biomedical “ontology” whose coverage includes signs, symptoms and diseases (Bodenreider 2004). Cross-annotations between GO and UMLS could improve biomedical knowledge. We present BioMeKE, Biological and Medical Knowledge Extractor, a new Java-based application, which relies on the UMLS to annotate sets of genes with biomedical concepts.

2 METHODS AND IMPLEMENTATION

The UMLS is made of two major components, the Metathesaurus^® (MTH), a repository of 1,179,177 concepts (2005AA release), and the Semantic Network, a limited network of 135 Semantic Types (ST). Each MTH concept is assigned to one or more ST. The MTH

is built by merging more than 100 vocabularies, including MeSH¹, GO and Genew terms² (Wain et al. 2004). MTH concepts are related by a set of 22,623,179 relations, including hierarchical relations, associative relations (‘other relations’) and co-occurrences in MEDLINE, with their frequencies.

The UMLS annotation in BioMeKE is performed in two steps.

Mapping gene or gene product names to MTH. The objective is to extract the MTH concepts corresponding to the genes. For each gene, the approved name and symbol, aliases, previous names and symbols of the gene, provided by Genew are successively searched for in the MTH. Filtering relying on five UMLS STs (Gene or Genome; Amino Acid, Peptide or Protein; Nucleic Acid, Nucleoside or Nucleotide; Molecular Function; Disease or Syndrome) is performed to select only the MTH concepts that correspond to genes or gene products.

Searching for MTH concepts to annotate the gene. This step exploits the MTH relations. For a given MTH concept, the annotation process selects concepts that are related to it through one of the following relations: parent, other relations, and co-occurrenceand assigned to at least one of the 22 relevant STs (see supplementary information) that may be of interest for the interpretation of post genomic data.

BioMeKE is implemented as a Java Swing application that relies on JTree, JTable and other GUI components. We have wrapped BioMeKE as a Java Web Start application. This technology provides several advantages over standard java applets or applications: the software Java Web Start is launched automatically when the user downloads for the first time a Java application using this technology; each time the user starts the application, the software Java Web Start checks if a new version of BioMeKE is available on the Web site and downloads it.

As BioMeKE uses the UMLS for the medical annotation, it requires a UMLS license. This license can be obtained on the UMLS site³. It is free for academic researchers.

1 MeSH is the National Library of Medicine's thesaurus used in MEDLINE.

2 Genew is the HUGO Gene Nomenclature Committeedatabase. It proposes nomenclature conventions for genes and now provides approved gene names and symbols

3 http://www.nlm.nih.gov/research/umls/license.html

G.marquet et al.

Fig 1: BioMeKE output screen shot represents the UMLS annotation (displayed by semantic types) and the official nomenclature for HFE.

BioMeKE takes as an input a list of gene or gene product identifiers. Those identifiers may be of different kinds, e.g.

LocusLink ID, Uniprot ID. The result of annotation is displayed as a tree structure. Moreover, the UMLS annotation can be classified according to the UMLS semantic types or to the relationships (Fig 1). For each annotated gene, a XML file is created.

3 ILLUSTRATION AND EVALUATION

Consider the gene HFE (LocusLink: 3077), for which a biomedical annotation was provided by BioMeKE. UMLS annotations provide complementary biological information to GO annotations (Table 1) including disorders associated to HFE (Fig 1).

GO annotations UMLS annotations

Genetic Function Genetic Markers Multifactorial Inheritance

Neoplastic Process Bile Duct Neoplasms Cholangiocarcinoma Liver neoplasms

Primary carcinoma of the liver cells Organ or Tissue Function Intestinal Absorption

Pathologic Function 9 MHC class I receptor activity

9 protein complex assembly 9 transport

9 iron ion transport 9 iron ion homeostasis 9 receptor mediated endocytosis 9 immune response

9 antigen presentation, endogenous antigen 9 antigen processing,

endogenous antigen via MHC class I

9 cytoplasm

9 integral to plasma membrane

Hyperpigmentation Insulin Resistance Tachycardia, Ventricular Hypertrophy, Right Ventricular Table 1: GO annotation and examples of complementary UMLS annotation for HFE.

An evaluation was done on a set of 43 genes known to be involved or not in iron metabolism (see supplementary information). All the 43 genes were mapped successfully to the MTH. We obtained annotations for 19 genes. The strict overlap between the UMLS annotation provided by BioMeKE and the GO annotation based on SOURCE (Diehn et al 2003) represents 0.1% of the UMLS annotation and 3.2% of the GO annotation. In order to evaluate the accuracy of the medical annotations provided by BioMeKE, a manual review of the UMLS annotation has been done by an expert involved in research in iron metabolism and iron related diseases (OL). It has shown that the hierarchical relations and associative relations provide a large amount of information which is complementary to GO and “expected,” i.e. corresponds to the current expert domain knowledge. The UMLS co-occurrences provide a large percentage of complementary annotation to GO. In addition, considering those with a frequency ≥ 10, we found that 60.3% gave information which was expected for the expert.

Our approach has been generalized to the Genew database. 79%

(18,504) of the 23,398 HGNC identifiers in the March 2005 version of Genew were found in the MTH. Only 3,158 (13 %) have annotations in the UMLS. A possible explanation is that we used the 2005AA version of the UMLS, which is the first one containing Genew terms. Therefore, not all the Genew concepts have relations with other MTH concepts. 632 genes were provided with annotation corresponding to disorders and/or physiology.

In conclusion, BioMeKE exploits the relations in the MTH and provides concepts that are related to a gene through hierarchical and associative relations, in particular diseases and medical conditions associated with genes. BioMeKE is useful to study biomedical information related to large lists of genes such as those obtained using high throughput technologies.

ACKNOWLEDGEMENTS

This work was supported by grants from the Région Bretagne (20046805, PRIR 139)

REFERENCES

The Gene Ontology Consortium (2000) Gene ontology: tool for the unification of biology. Nature Genet, 25, 25-9.

Bodenreider, O (2004) The Unified Medical Language System (UMLS):

integrating biomedical terminology. Nucleic Acids Res, 32 Database issue, 267-70.

Diehn, M. et al (2003) SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data. Nucleic Acids Res, 31, 219-223

Wain, HM. et al. (2004) Genew: The Human Gene Nomenclature Database, 2004 updates. Nucleic Acids Res, 32 Database issue, 255-7.

IN I NF FO OR R M M A A T T IO I O NS N S S S U U P P P P L L É É M M E E N N T T AI A IR R E E S S S S UR U R L L ’ ’ A A R R T T I I C C L L E E 1 1

1. Liste des types sémantiques 2. Licence UMLS

3. Evaluation

Extrait du site Web :

http://www.med.univ-rennes1.fr/biomeke/suppinfo.php

BioMeKE

Supplementary information

list of Semantic Types UMLS license evaluation

1 - List of Semantic Types :

The 22 Semantic Types that may be interest for the interpretation of post genomic data.

Semantic type

Definition

Acquired Abnormality

An abnormal structure, or one that is abnormal in size or location, found in or deriving from a previously normal structure. Acquired abnormalities are distinguished from diseases even though they may

result in pathological functioning (e.g., "hernias incarcerate").

Amino Acid, Peptide, or

Protein

Amino acids and chains of amino acids connected by peptide linkages.

Anatomical Structure

A normal or pathological part of the anatomy or structural organization of an organism.

Biologic

Function

A state, activity or process of the body or one of its systems or parts.

Cell Function

A physiologic function inherent to cells or cell components.

Cell or Molecular Dysfunction

A pathologic function inherent to cells, parts of cells, or molecules.

Congenital Abnormality

An abnormal structure, or one that is abnormal in size or location, present at birth or evolving over time as a result of a defect in

embryogenesis.

Disease or Syndrome

A condition which alters or interferes with a normal process, state, or activity of an organism. It is usually characterized by the abnormal

functioning of one or more of the host's systems, parts, or organs.

Included here is a complex of symptoms descriptive of a disorder.

Embryonic Structure

An anatomical structure that exists only before the organism is fully formed; in mammals, for example, a structure that exists only prior to the

birth of the organism. This structure may be normal or abnormal.

Experimental

Model of Disease

A representation in a non-human organism of a human disease for the

purpose of research into its mechanism or treatment.

Finding

That which is discovered by direct observation or measurement of an organism attribute or condition, including the clinical history of the patient. The history of the presence of a disease is a 'Finding' and is

distinguished from the disease itself.

Gene or Genome

A specific sequence, or in the case of the genome the complete sequence, of nucleotides along a molecule of DNA or RNA (in the case of some

viruses) which represent the functional units of heredity.

Genetic Function

Functions of or related to the maintenance, translation or expression of the genetic material.

Injury or

Poisoning

A traumatic wound, injury, or poisoning caused by an external agent or force.

Mental or Behavioral Dysfunction

A clinically significant dysfunction whose major manifestation is behavioral or psychological. These dysfunctions may have identified or

presumed biological etiologies or manifestations.

Molecular

Function

A physiologic function occurring at the molecular level.

Neoplastic Process

A new and abnormal growth of tissue in which the growth is uncontrolled and progressive. The growths may be malignant or benign.

Organ or Tissue

Function

A physiologic function of a particular organ, organ system, or tissue.

Pathologic Function

A disordered process, activity, or state of the organism as a whole, of a body system or systems, or of multiple organs or tissues. Included here

are normal responses to a negative stimulus as well as patholologic conditions or states that are less specific than a disease. Pathologic

functions frequently have systemic effects.

Phenomenon or

Process

A process or state which occurs naturally or as a result of an activity.

Population Group

An indivdual or individuals classified according to their sex, racial origin, religion, common place of living, financial or social status, or

some other cultural or behavioral attribute.

Tissue

An aggregation of similarly specialized cells and the associated intercellular substance. Tissues are relatively non-localized in

comparison to body parts, organs or organ components.

2 - UMLS license:

BioMeKE uses the UMLS for the medical annotation.

The UMLS license is free for the academic researchers.

UMLS license extract :

" This Agreement is made by and between the National Library of Medicine, Department of Health and Human Services (hereinafter referred to as "NLM") and the LICENSEE.

WHEREAS, the NLM was established by statute in order to assist the advancement of medical and related sciences, and to aid the dissemination and exchange of scientific and other information important to the progress of medicine and to the public health, (section 465 of the Public Health Service Act, as amended (42 U.S.C. section 286) and to carry out this purpose has been authorized to develop the Unified Medical Language System® (UMLS) to facilitate the retrieval and integration of machine-readable biomedical information from disparate sources; WHEREAS, the NLM's UMLS project has produced the UMLS Metathesaurus, a machine-readable vocabulary knowledge source, that is useful in a variety of settings; WHEREAS, the LICENSEE is willing to use the UMLS Metathesaurus at its sole risk and at no expense to NLM, which will result in information useful to NLM, may provide immediate improvements in biomedical information transfer to segments of the biomedical community, and is consistent with NLM's statutory functions, NOW THEREFORE, it is mutually agreed as follows:

1. The NLM hereby grants a nonexclusive, non-transferable right to LICENSEE to use the UMLS Metathesaurus and incorporate its content in any computer applications or systems designed to improve access to biomedical information of any type subject to the restrictions in other provisions of this Agreement. The names and addresses of licensees authorized to use the UMLS products are public information.

2. No charges, usage fees or royalties will be paid to NLM."

...UMLS web site 3 - Evaluation:

This evaluation has shown the interest of BioMeKE from a biomedical standpoint, especially for the biologist who studies a broad list of genes obtained by a high throughput technology.

Two types of evaluation were done a quantitative evaluation and a qualitative evaluation.

The evaluation was done on a set of 43 genes known to be involved or not in iron metabolism.

Each gene has LocusLink ID that has been recovered via the LocusLink interface (view the

list of genes).

Mapping

locuslink

xml yes

* CUI : Each concept in the Metathesaurus (UMLS) has a unique and permanent concept identifier (CUI)

* Semantic Types : GG --> Gene or Genome

AAAP --> Amino Acid, Peptide or Protein MF --> Molecular Function

DS --> Disease or syndrome

Annotation

In order to evaluate the accuracy of the medical annotations provided by BioMeKE, a manual review of the UMLS annotation has been done by an expert involved in research in iron metabolism and iron related diseases (Olivier Loréal, INSERM U522) list of publication . Two criteria were used:

• Complementary information: was used to determine whether was redundant with

GO annotation or complementary to GO.

A UMLS annotation is regarded as complementary compared to GO when the expert considers that it corresponds to new information. For example, the GO annotations for EPOR are "erythropoietin receptor Activity", "signal transduction" and "integral to plasma membrane" and among the UMLS annotation we find "Hematopoiesis". This annotation is judged not complementary to GO

• Expected information: was used to determine if a UMLS annotation was expected or

not expected. This criterion was evaluated only on the annotation that was judged

complementary to the first criteria: Expected annotation corresponds to a relation

between the gene and the UMLS concept that is valid from the expert's standpoint. For

example, 'Kidney Failure, Chronic' is judged expected by the expert and 'Epilepsy,

Temporal lobe' is judged not expected for the gene EPOR

Example of UMLS annotations annotated by the expert :

Gene EPOR LocusLink ID 2057

GO annotation :

erythropoietin receptor activity signal transduction

integral to plasma membrane

UMLS Annotation Complementary to GO Expected

Erythropoeitin receptor no yes

Anemia, Sickle cell yes yes

Kidney Failure, Chronic yes yes

Endometriosis, site unspecified yes no

Epilepsy, Temporal lobe yes no

Cytokine Receptor Gene yes no

Leukelia, Erythroblastic, Acute yes yes

Dysmyelopoietic Syndromes yes yes

Hematopoiesis no yes

Bone Marrow yes yes

Gene TF LocusLink ID 7018

GO annotation :

ferric iron binding transport

iron ion transport iron ion homeostasis

UMLS Annotation Complementary to GO Expected

Serum, Urine and Miscellaneous Proteins yes yes

Oxidative Stress yes no

Hemocromatosis yes yes

Alzheimer's Disease yes yes

Staphylococcal Infectious yes no

Major histocompatibility Complex yes yes

Alternative Splicing yes yes

Alcohol-Related Disorders yes yes

iron metabolism no yes

Sertoli cell Tumor yes no

Primary carcinoma of the liver cells yes yes

Livers neoplasms yes yes

The annotation files can be download here.

Graphical representation of the manual result evaluation

Representation, for each relation type, of the percentage of UMLS annotation which were

complementary or not (disk) to GO annotation, and, inside of this complementary annotation,

those which were expected or not expected (bar) for the expert. The purple part of the disk

represents the UMLS annotation which is complementary to GO annotation whereas the

yellow part indicates UMLS annotation which is not giving complementary information. The

expected annotations were calculated on the complementary annotation. The hatched part

represents expected annotation and the white part represents annotations which were not

expected

L L ’ ’ E E N N T T R R E E P P Ô Ô T T G G E E D D A A W W

II. INTÉGRATION DE DONNÉES DANS L’ENTREPÔT GEDAW

1. INTRODUCTION

Arguant que l’interprétation biologique des données générées par les puces à ADN requiert l’enrichissement des données d’expression par intégration d’informations, et que l’approche entrepôt de données est adaptée à l’analyse en masse des données d’expression, nous avons développé GEDAW.

GEDAW est un entrepôt de données orienté-objet dédié à l’analyse des données engendrées par l’étude du transcriptome hépatique. Il intègre des données d’expression enrichies à partir de sources et de standards des domaines de la génomique, de la biologie et de la médecine.

Nous nous sommes focalisés sur l’utilisation de sources et de standards structurés et semi-structurés pour une intégration forte et systématique au sein d’un schéma global qui regroupe les instances provenant des diverses sources intégrées.

2. MISE EN ŒUVRE ET DISCUSSION

Architecture

Le schéma de données de GEDAW est subdivisé en trois parties correspondant aux différents types de données intégrés : 1) les données expérimentales, c'est-à-dire les mesures d’expression de gènes en fonction de conditions expérimentales, 2) les annotations des gènes étudiés (séquence du gène, de l’ARNm, de la protéine ainsi que leurs annotations) et 3) les annotations biomédicales.

Sources de données

Les sources de données utilisées pour l’instanciation de l’entrepôt sont soit locales soit réparties sur le Web, chacune ayant son propre système de représentation. Elles ont été choisies pour leurs propriétés de contenu et de structuration, pour ainsi permettre une extraction efficace des entités d’intérêt. Les sources de données sont les suivantes :

Une base de données relationnelle comme source de données expérimentales. Une base de données a été développée au laboratoire pour la gestion des données issues de la technologie des puces à ADN. Elle est en accord avec les standards MIAME. Cette base a été conçue en dehors de l’entrepôt GEDAW pour ne pas le surcharger de détails expérimentaux. Seuls les ratios normalisés ainsi que les libellés d’expériences sont exportés vers GEDAW pour de futures analyses.

GenBank comme source de données génomiques. Les enregistrements au format XML de la banque de données GenBank sont utilisés pour l’intégration de données génomiques dans GEDAW.

Les ontologies GO et UMLS comme sources de données biomédicales. GO et UMLS sont utilisées pour fournir respectivement l’annotation fonctionnelle et la connaissance biomédicale sur les gènes étudiés. C’est l’application BioMeKE, présentée précédemment qui délivre cette double annotation. L’application fournit dans le format XML, les termes GO et les concepts UMLS associés à une liste de gènes.

Schéma et processus d’intégration

Un schéma orienté objet unique réunit toutes les informations expérimentales, génomiques et biomédicales autour des éléments centraux que sont le gène, l’ARNm et la protéine. Le langage Java est utilisé pour la description et l’instanciation des classes et le SGBDO (Système de Gestion de Base de Données Objet) FastObjects est utilisé pour la persistance des classes.

Parce que les sources de données sélectionnées sont structurées ou semi-structurées, nous avons pu définir, lors du processus d’intégration, des règles de correspondance qui assurent d’une part la correspondance entre les schémas des sources et le schéma de GEDAW, et d’autre part la réconciliation des données. Ainsi, par le biais de règles structurales, agissant au niveau du schéma, les éléments ou concepts de GenBank, de GO et de l’UMLS sont sélectionnés, extraits et intégrés. De plus, des règles sémantiques, agissant au niveau des instances, permettent la réconciliation de la nomenclature des gènes : l’identifiant GeneID ainsi que les synonymes de noms de gènes fournis par BioMeKE sont utilisés pour regrouper dans GEDAW les données associées à un même gène.

L’intégration dans GEDAW débute par le chargement des identifiants des gènes représentés sur la puce. Puis les mesures d’expression ainsi que les données génomiques, biologiques et médicales sont sélectionnées, transformées puis intégrées dans GEDAW.

Finalement, l’utilisateur accède à l’information intégrée et réconciliée via une interface Java.

L’interface permet de composer des requêtes OQL multicritères qui conduisent à l’inter-relation de données diverses jusqu’alors non confrontées, ouvrant ainsi la voie à la suggestion de nouvelles hypothèses.

Dans le document présentée DEVANT L UNIVERSITÉ DE RENNES 1 pour obtenir le grade de : DOCTEUR DE L UNIVERSITÉ DE RENNES 1 PAR Emilie GUÉRIN TITRE DE LA THÈSE : (Page 90-186)