IV. INTÉGRATION POUR L’ANALYSE DU TRANSCRIPTOME
2. MISE EN ŒUVRE ET DISCUSSION
I. BIOMEKE POUR L’ANNOTATION
BIOMÉDICALE DE GÈNES
A A R R T T I I C C L L E E 1 1
BioMeKE : a UMLS-based system useful for biomedical annotation of genes
G. Marquet, E. Guérin, O. Loréal and A. Burgun
[Article en révision pour publication dans la revue Bioinformatics]
Vol. 00 no. 0 2005, pages 1–5 doi:10.1093/bioinformatics/bti283
BIOINFORMATICS
Databases and Ontologies
BioMeKE: a UMLS-based system useful for biomedical annotation of genes
Gwenaëlle Marquet 1*, Emilie Guérin 2, Olivier Loréal 2, Anita Burgun 1
1 EA 3888, IFR 140, Université de Rennes 1, Faculté de Médecine - 35043 Rennes Cedex - France
2 INSERM U522, IFR 140, Université de Rennes 1, CHRU Pontchaillou - 35033 Rennes Cedex - France
.
ABSTRACT
Summary: The Unified Medical Language system (UMLS) is a potential resource for providing associations between genes and medical knowledge, which may complement Gene Ontology (GO) annotation. We present BioMeKE (BioMedical Knowledge Extraction system), a UMLS-based annotation system that exploits the relations present in the UMLS. An evaluation of the system on a set of 43 genes known to be involved or not in iron metabolism has shown the interest of this method, for providing association between genes and medical conditions. In conclusion, BioMeKE is useful to study biomedical information related to large lists of genes such as those obtained using high throughput technologies.
Availability: BioMeKE is freely available via Java Web Start at http://www.med.univ-rennes1.fr/biomeke/
Contact: gwenaelle.marquet@univ-rennes1.fr
Supplementary information: http://www.med.univ-rennes1.fr/biomeke/suppinfo.php
1 INTRODUCTION
Functional annotations of genes as well as gene-disorder relations play a major role for analyzing data obtained using high throughput technologies. Gene Ontology™ (GO) annotation represents (The Gene Ontology Consortium 2000) the molecular functions, biological processes, and cellular components associated with genes and gene products. GO annotation does not provide information on pathologic conditions and disorders that have been associated with genes. The Unified Medical Language System® (UMLS) is a biomedical “ontology” whose coverage includes signs, symptoms and diseases (Bodenreider 2004). Cross-annotations between GO and UMLS could improve biomedical knowledge. We present BioMeKE, Biological and Medical Knowledge Extractor, a new Java-based application, which relies on the UMLS to annotate sets of genes with biomedical concepts.
2 METHODS AND IMPLEMENTATION
The UMLS is made of two major components, the Metathesaurus® (MTH), a repository of 1,179,177 concepts (2005AA release), and the Semantic Network, a limited network of 135 Semantic Types (ST). Each MTH concept is assigned to one or more ST. The MTH
is built by merging more than 100 vocabularies, including MeSH1, GO and Genew terms2 (Wain et al. 2004). MTH concepts are related by a set of 22,623,179 relations, including hierarchical relations, associative relations (‘other relations’) and co-occurrences in MEDLINE, with their frequencies.
The UMLS annotation in BioMeKE is performed in two steps.
Mapping gene or gene product names to MTH. The objective is to extract the MTH concepts corresponding to the genes. For each gene, the approved name and symbol, aliases, previous names and symbols of the gene, provided by Genew are successively searched for in the MTH. Filtering relying on five UMLS STs (Gene or Genome; Amino Acid, Peptide or Protein; Nucleic Acid, Nucleoside or Nucleotide; Molecular Function; Disease or Syndrome) is performed to select only the MTH concepts that correspond to genes or gene products.
Searching for MTH concepts to annotate the gene. This step exploits the MTH relations. For a given MTH concept, the annotation process selects concepts that are related to it through one of the following relations: parent, other relations, and co-occurrenceand assigned to at least one of the 22 relevant STs (see supplementary information) that may be of interest for the interpretation of post genomic data.
BioMeKE is implemented as a Java Swing application that relies on JTree, JTable and other GUI components. We have wrapped BioMeKE as a Java Web Start application. This technology provides several advantages over standard java applets or applications: the software Java Web Start is launched automatically when the user downloads for the first time a Java application using this technology; each time the user starts the application, the software Java Web Start checks if a new version of BioMeKE is available on the Web site and downloads it.
As BioMeKE uses the UMLS for the medical annotation, it requires a UMLS license. This license can be obtained on the UMLS site3. It is free for academic researchers.
1 MeSH is the National Library of Medicine's thesaurus used in MEDLINE.
2 Genew is the HUGO Gene Nomenclature Committeedatabase. It proposes nomenclature conventions for genes and now provides approved gene names and symbols
3 http://www.nlm.nih.gov/research/umls/license.html
G.marquet et al.
80
Fig 1: BioMeKE output screen shot represents the UMLS annotation (displayed by semantic types) and the official nomenclature for HFE.
BioMeKE takes as an input a list of gene or gene product identifiers. Those identifiers may be of different kinds, e.g.
LocusLink ID, Uniprot ID. The result of annotation is displayed as a tree structure. Moreover, the UMLS annotation can be classified according to the UMLS semantic types or to the relationships (Fig 1). For each annotated gene, a XML file is created.
3 ILLUSTRATION AND EVALUATION
Consider the gene HFE (LocusLink: 3077), for which a biomedical annotation was provided by BioMeKE. UMLS annotations provide complementary biological information to GO annotations (Table 1) including disorders associated to HFE (Fig 1).
GO annotations UMLS annotations
Genetic Function Genetic Markers Multifactorial Inheritance
Neoplastic Process Bile Duct Neoplasms Cholangiocarcinoma Liver neoplasms
Primary carcinoma of the liver cells Organ or Tissue Function Intestinal Absorption
Pathologic Function 9 MHC class I receptor activity
9 protein complex assembly 9 transport
9 iron ion transport 9 iron ion homeostasis 9 receptor mediated endocytosis 9 immune response
9 antigen presentation, endogenous antigen 9 antigen processing,
endogenous antigen via MHC class I
9 cytoplasm
9 integral to plasma membrane
Hyperpigmentation Insulin Resistance Tachycardia, Ventricular Hypertrophy, Right Ventricular Table 1: GO annotation and examples of complementary UMLS annotation for HFE.
An evaluation was done on a set of 43 genes known to be involved or not in iron metabolism (see supplementary information). All the 43 genes were mapped successfully to the MTH. We obtained annotations for 19 genes. The strict overlap between the UMLS annotation provided by BioMeKE and the GO annotation based on SOURCE (Diehn et al 2003) represents 0.1% of the UMLS annotation and 3.2% of the GO annotation. In order to evaluate the accuracy of the medical annotations provided by BioMeKE, a manual review of the UMLS annotation has been done by an expert involved in research in iron metabolism and iron related diseases (OL). It has shown that the hierarchical relations and associative relations provide a large amount of information which is complementary to GO and “expected,” i.e. corresponds to the current expert domain knowledge. The UMLS co-occurrences provide a large percentage of complementary annotation to GO. In addition, considering those with a frequency ≥ 10, we found that 60.3% gave information which was expected for the expert.
Our approach has been generalized to the Genew database. 79%
(18,504) of the 23,398 HGNC identifiers in the March 2005 version of Genew were found in the MTH. Only 3,158 (13 %) have annotations in the UMLS. A possible explanation is that we used the 2005AA version of the UMLS, which is the first one containing Genew terms. Therefore, not all the Genew concepts have relations with other MTH concepts. 632 genes were provided with annotation corresponding to disorders and/or physiology.
In conclusion, BioMeKE exploits the relations in the MTH and provides concepts that are related to a gene through hierarchical and associative relations, in particular diseases and medical conditions associated with genes. BioMeKE is useful to study biomedical information related to large lists of genes such as those obtained using high throughput technologies.
ACKNOWLEDGEMENTS
This work was supported by grants from the Région Bretagne (20046805, PRIR 139)
REFERENCES
The Gene Ontology Consortium (2000) Gene ontology: tool for the unification of biology. Nature Genet, 25, 25-9.
Bodenreider, O (2004) The Unified Medical Language System (UMLS):
integrating biomedical terminology. Nucleic Acids Res, 32 Database issue, 267-70.
Diehn, M. et al (2003) SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data. Nucleic Acids Res, 31, 219-223
Wain, HM. et al. (2004) Genew: The Human Gene Nomenclature Database, 2004 updates. Nucleic Acids Res, 32 Database issue, 255-7.
IN I NF FO OR R M M A A T T IO I O NS N S S S U U P P P P L L É É M M E E N N T T AI A IR R E E S S S S UR U R L L ’ ’ A A R R T T I I C C L L E E 1 1
1. Liste des types sémantiques 2. Licence UMLS
3. Evaluation
Extrait du site Web :
http://www.med.univ-rennes1.fr/biomeke/suppinfo.php
BioMeKE
Supplementary information
list of Semantic Types UMLS license evaluation
1 - List of Semantic Types :
The 22 Semantic Types that may be interest for the interpretation of post genomic data.
Semantic type
DefinitionAcquired Abnormality
An abnormal structure, or one that is abnormal in size or location, found in or deriving from a previously normal structure. Acquired abnormalities are distinguished from diseases even though they may
result in pathological functioning (e.g., "hernias incarcerate").
Amino Acid, Peptide, or
Protein
Amino acids and chains of amino acids connected by peptide linkages.
Anatomical Structure
A normal or pathological part of the anatomy or structural organization of an organism.
Biologic
Function
A state, activity or process of the body or one of its systems or parts.
Cell Function
A physiologic function inherent to cells or cell components.
Cell or Molecular Dysfunction
A pathologic function inherent to cells, parts of cells, or molecules.
Congenital Abnormality
An abnormal structure, or one that is abnormal in size or location, present at birth or evolving over time as a result of a defect in
embryogenesis.
Disease or Syndrome
A condition which alters or interferes with a normal process, state, or activity of an organism. It is usually characterized by the abnormal
functioning of one or more of the host's systems, parts, or organs.
Included here is a complex of symptoms descriptive of a disorder.
Embryonic Structure
An anatomical structure that exists only before the organism is fully formed; in mammals, for example, a structure that exists only prior to the
birth of the organism. This structure may be normal or abnormal.
Experimental
Model of Disease
A representation in a non-human organism of a human disease for the
purpose of research into its mechanism or treatment.
Finding
That which is discovered by direct observation or measurement of an organism attribute or condition, including the clinical history of the patient. The history of the presence of a disease is a 'Finding' and is
distinguished from the disease itself.
Gene or Genome
A specific sequence, or in the case of the genome the complete sequence, of nucleotides along a molecule of DNA or RNA (in the case of some
viruses) which represent the functional units of heredity.
Genetic Function
Functions of or related to the maintenance, translation or expression of the genetic material.
Injury or
Poisoning
A traumatic wound, injury, or poisoning caused by an external agent or force.
Mental or Behavioral Dysfunction
A clinically significant dysfunction whose major manifestation is behavioral or psychological. These dysfunctions may have identified or
presumed biological etiologies or manifestations.
Molecular
Function
A physiologic function occurring at the molecular level.
Neoplastic Process
A new and abnormal growth of tissue in which the growth is uncontrolled and progressive. The growths may be malignant or benign.
Organ or Tissue
Function
A physiologic function of a particular organ, organ system, or tissue.
Pathologic Function
A disordered process, activity, or state of the organism as a whole, of a body system or systems, or of multiple organs or tissues. Included here
are normal responses to a negative stimulus as well as patholologic conditions or states that are less specific than a disease. Pathologic
functions frequently have systemic effects.
Phenomenon or
Process
A process or state which occurs naturally or as a result of an activity.
Population Group
An indivdual or individuals classified according to their sex, racial origin, religion, common place of living, financial or social status, or
some other cultural or behavioral attribute.
Tissue
An aggregation of similarly specialized cells and the associated intercellular substance. Tissues are relatively non-localized in
comparison to body parts, organs or organ components.
2 - UMLS license:
BioMeKE uses the UMLS for the medical annotation.
The UMLS license is free for the academic researchers.
UMLS license extract :
" This Agreement is made by and between the National Library of Medicine, Department of Health and Human Services (hereinafter referred to as "NLM") and the LICENSEE.
WHEREAS, the NLM was established by statute in order to assist the advancement of medical and related sciences, and to aid the dissemination and exchange of scientific and other information important to the progress of medicine and to the public health, (section 465 of the Public Health Service Act, as amended (42 U.S.C. section 286) and to carry out this purpose has been authorized to develop the Unified Medical Language System® (UMLS) to facilitate the retrieval and integration of machine-readable biomedical information from disparate sources; WHEREAS, the NLM's UMLS project has produced the UMLS Metathesaurus, a machine-readable vocabulary knowledge source, that is useful in a variety of settings; WHEREAS, the LICENSEE is willing to use the UMLS Metathesaurus at its sole risk and at no expense to NLM, which will result in information useful to NLM, may provide immediate improvements in biomedical information transfer to segments of the biomedical community, and is consistent with NLM's statutory functions, NOW THEREFORE, it is mutually agreed as follows:
1. The NLM hereby grants a nonexclusive, non-transferable right to LICENSEE to use the UMLS Metathesaurus and incorporate its content in any computer applications or systems designed to improve access to biomedical information of any type subject to the restrictions in other provisions of this Agreement. The names and addresses of licensees authorized to use the UMLS products are public information.
2. No charges, usage fees or royalties will be paid to NLM."
...UMLS web site 3 - Evaluation:
This evaluation has shown the interest of BioMeKE from a biomedical standpoint, especially for the biologist who studies a broad list of genes obtained by a high throughput technology.
Two types of evaluation were done a quantitative evaluation and a qualitative evaluation.
The evaluation was done on a set of 43 genes known to be involved or not in iron metabolism.
Each gene has LocusLink ID that has been recovered via the LocusLink interface (view the
list of genes).Mapping
locuslink
ID CUI* semantic
Types*
UMLS annotation
XML
file Evaluation
538 C1412688 GG no
viewxml
5621 C1418941 GG no
viewxml
57817 C1423607 GG no
viewxml
6647 C1420306 GG no
viewxml
3162 C1415619 GG no
viewxml
3163 C1415620 GG no
viewxml
4241 C1417130 GG no
viewxml
4500 C1417400 GG no
viewxml
79901 C1427130 GG no
viewxml
80025 C1423814 GG no
viewxml
9843 C1415510 GG no
viewxml
9973 C1413192 GG no
viewxml
6648 C1420307 GG no
viewxml
6649 C1420308 GG no
viewxml
7390 C1421375 GG no
viewxml
7037 C1420708 GG no
viewxml
1356 C1439306 GG yes
viewxml yes
2420 C1414813 GG no
viewxml
2495 C1414833 GG no
view xml2512 C1414852 GG no
viewxml
205 C1412307 GG no
viewxml
2235 C1414580 GG no
viewxml
2395 C0387678 AAPP yes
viewxml yes
2941 C1415331 GG no
viewxml
3240 C0018595,C1415692 AAPP/GG yes
viewxml yes
7018 C0040679,C1442762 AAPP/GG yes
viewxml yes
7036 C0908063,C1420707 AAPP/GG yes
viewxml yes
30061 C0915115,C1456396 AAPP/GG yes
viewxml yes
210 C1439270 GG no
viewxml
1371 C0009985,C1413681 AAPP/GG yes
viewxml yes
3091 C1333897 GG yes
viewxml yes
3077 C0018995,C1384665 DS/GG yes
viewxml yes
2597 C0017857,C1414968 AAPP/GG yes
viewxml yes
4057 C0022942,C1416933 AAPP/GG yes
viewxml yes
540 C0296649,C1412689 AAPP/GG yes
viewxml yes
4891 C1420089 GG no
viewxml
2057 C0059570,C1333342 AAPP/GG yes
viewxml yes
3263 C0019067,C1415712 AAPP/GG yes
viewxml yes
567 C0005149,C1412709 AAPP/GG yes
viewxml yes
48 C0378502,C1412126 AAPP/GG yes
viewxml yes
3658 C1442498 GG yes
viewxml yes
7422 C0078058,C1336934,C1323364 AAPP/MF/GG yes
viewxml yes
7428 C0299505,C0019562,C0694897 AAPP/DS/GG yes
viewxml yes
* CUI : Each concept in the Metathesaurus (UMLS) has a unique and permanent concept identifier (CUI)
* Semantic Types : GG --> Gene or Genome
AAAP --> Amino Acid, Peptide or Protein MF --> Molecular Function
DS --> Disease or syndrome
AnnotationIn order to evaluate the accuracy of the medical annotations provided by BioMeKE, a manual review of the UMLS annotation has been done by an expert involved in research in iron metabolism and iron related diseases (Olivier Loréal, INSERM U522) list of publication . Two criteria were used:
• Complementary information: was used to determine whether was redundant with
GO annotation or complementary to GO.
A UMLS annotation is regarded as complementary compared to GO when the expert considers that it corresponds to new information. For example, the GO annotations for EPOR are "erythropoietin receptor Activity", "signal transduction" and "integral to plasma membrane" and among the UMLS annotation we find "Hematopoiesis". This annotation is judged not complementary to GO
• Expected information: was used to determine if a UMLS annotation was expected or
not expected. This criterion was evaluated only on the annotation that was judged
complementary to the first criteria: Expected annotation corresponds to a relation
between the gene and the UMLS concept that is valid from the expert's standpoint. For
example, 'Kidney Failure, Chronic' is judged expected by the expert and 'Epilepsy,
Temporal lobe' is judged not expected for the gene EPOR
Example of UMLS annotations annotated by the expert :
Gene EPOR LocusLink ID 2057
GO annotation :
erythropoietin receptor activity signal transduction
integral to plasma membrane
UMLS Annotation Complementary to GO Expected
Erythropoeitin receptor no yes
Anemia, Sickle cell yes yes
Kidney Failure, Chronic yes yes
Endometriosis, site unspecified yes no
Epilepsy, Temporal lobe yes no
Cytokine Receptor Gene yes no
Leukelia, Erythroblastic, Acute yes yes
Dysmyelopoietic Syndromes yes yes
Hematopoiesis no yes
Bone Marrow yes yes
Gene TF LocusLink ID 7018
GO annotation :
ferric iron binding transport
iron ion transport iron ion homeostasis
UMLS Annotation Complementary to GO Expected
Serum, Urine and Miscellaneous Proteins yes yes
Oxidative Stress yes no
Hemocromatosis yes yes
Alzheimer's Disease yes yes
Staphylococcal Infectious yes no
Major histocompatibility Complex yes yes
Alternative Splicing yes yes
Alcohol-Related Disorders yes yes
iron metabolism no yes
Sertoli cell Tumor yes no
Primary carcinoma of the liver cells yes yes
Livers neoplasms yes yes
The annotation files can be download here.
Graphical representation of the manual result evaluation
Representation, for each relation type, of the percentage of UMLS annotation which were
complementary or not (disk) to GO annotation, and, inside of this complementary annotation,
those which were expected or not expected (bar) for the expert. The purple part of the disk
represents the UMLS annotation which is complementary to GO annotation whereas the
yellow part indicates UMLS annotation which is not giving complementary information. The
expected annotations were calculated on the complementary annotation. The hatched part
represents expected annotation and the white part represents annotations which were not
expected
L L ’ ’ E E N N T T R R E E P P Ô Ô T T G G E E D D A A W W
II. INTÉGRATION DE DONNÉES DANS L’ENTREPÔT GEDAW
1. INTRODUCTION
Arguant que l’interprétation biologique des données générées par les puces à ADN requiert l’enrichissement des données d’expression par intégration d’informations, et que l’approche entrepôt de données est adaptée à l’analyse en masse des données d’expression, nous avons développé GEDAW.
GEDAW est un entrepôt de données orienté-objet dédié à l’analyse des données engendrées par l’étude du transcriptome hépatique. Il intègre des données d’expression enrichies à partir de sources et de standards des domaines de la génomique, de la biologie et de la médecine.
Nous nous sommes focalisés sur l’utilisation de sources et de standards structurés et semi-structurés pour une intégration forte et systématique au sein d’un schéma global qui regroupe les instances provenant des diverses sources intégrées.
2. MISE EN ŒUVRE ET DISCUSSION
Architecture
Le schéma de données de GEDAW est subdivisé en trois parties correspondant aux différents types de données intégrés : 1) les données expérimentales, c'est-à-dire les mesures d’expression de gènes en fonction de conditions expérimentales, 2) les annotations des gènes étudiés (séquence du gène, de l’ARNm, de la protéine ainsi que leurs annotations) et 3) les annotations biomédicales.
Sources de données
Les sources de données utilisées pour l’instanciation de l’entrepôt sont soit locales soit réparties sur le Web, chacune ayant son propre système de représentation. Elles ont été choisies pour leurs propriétés de contenu et de structuration, pour ainsi permettre une extraction efficace des entités d’intérêt. Les sources de données sont les suivantes :
Une base de données relationnelle comme source de données expérimentales. Une base de données a été développée au laboratoire pour la gestion des données issues de la technologie des puces à ADN. Elle est en accord avec les standards MIAME. Cette base a été conçue en dehors de l’entrepôt GEDAW pour ne pas le surcharger de détails expérimentaux. Seuls les ratios normalisés ainsi que les libellés d’expériences sont exportés vers GEDAW pour de futures analyses.
GenBank comme source de données génomiques. Les enregistrements au format XML de la banque de données GenBank sont utilisés pour l’intégration de données génomiques dans GEDAW.
Les ontologies GO et UMLS comme sources de données biomédicales. GO et UMLS sont utilisées pour fournir respectivement l’annotation fonctionnelle et la connaissance biomédicale sur les gènes étudiés. C’est l’application BioMeKE, présentée précédemment qui délivre cette double annotation. L’application fournit dans le format XML, les termes GO et les concepts UMLS associés à une liste de gènes.
Schéma et processus d’intégration
Un schéma orienté objet unique réunit toutes les informations expérimentales, génomiques et biomédicales autour des éléments centraux que sont le gène, l’ARNm et la protéine. Le langage Java est utilisé pour la description et l’instanciation des classes et le SGBDO (Système de Gestion de Base de Données Objet) FastObjects est utilisé pour la persistance des classes.
Parce que les sources de données sélectionnées sont structurées ou semi-structurées, nous avons pu définir, lors du processus d’intégration, des règles de correspondance qui assurent d’une part la correspondance entre les schémas des sources et le schéma de GEDAW, et d’autre part la réconciliation des données. Ainsi, par le biais de règles structurales, agissant au niveau du schéma, les éléments ou concepts de GenBank, de GO et de l’UMLS sont sélectionnés, extraits et intégrés. De plus, des règles sémantiques, agissant au niveau des instances, permettent la réconciliation de la nomenclature des gènes : l’identifiant GeneID ainsi que les synonymes de noms de gènes fournis par BioMeKE sont utilisés pour regrouper dans GEDAW les données associées à un même gène.
L’intégration dans GEDAW débute par le chargement des identifiants des gènes représentés sur la puce. Puis les mesures d’expression ainsi que les données génomiques, biologiques et médicales sont sélectionnées, transformées puis intégrées dans GEDAW.
Finalement, l’utilisateur accède à l’information intégrée et réconciliée via une interface Java.
L’interface permet de composer des requêtes OQL multicritères qui conduisent à l’inter-relation de données diverses jusqu’alors non confrontées, ouvrant ainsi la voie à la suggestion de nouvelles hypothèses.