• Aucun résultat trouvé

Evaluation of the Ontology Classifi cation Method

Dans le document Data Mining in Biomedicine Using Ontologies (Page 94-100)

Analyzing and Classifying Protein Family Data Using OWL Reasoning

4.4 Ontology Classifi cation in the Comparative Analysis of Three Protozoan Parasites—A Case Study

4.4.5 Evaluation of the Ontology Classifi cation Method

In total, the ontology extracted and classifi ed 142 protein phosphatases from the three parasite species. For the majority of sequences (approx 78%), the ontology classifi cation provided more information than their original annotations, or the same level of detail. Therefore, the ontology system again surpassed, or was equal to, the automated sequence analysis method in the majority of cases. The remain-der of the sequences suffered a loss of information, being classifi ed as a member of the PTP, PPM, or PPP parent classes, instead of in individual subfamily groups, or were false positives (i.e., proteins that did not belong to the protein phosphatase family).

There were 10 false-positive sequences that had been placed as members of the general tyrosine phosphatase family (7 sequences) and low molecular weight PTPs (LMW-PTPs) (3 sequences). A multiple alignment had to be performed to deter-mine the lack of conserved motifs in these sequences. For the LMW-PTPs, the In-terPro domain is not specifi c enough to distinguish between LMW-PTPs and other very similar types of enzymes, such as arsenate reductases for the lower eukaryotes [50]. If the prescreening method had been employed before this analysis, however, these false positives would not have been detected. Again, this problem and the loss of information from a small number of sequences can be largely attributed to the higher eukaryotic bias in some of the InterPro domains.

The ontology classifi cation method was as good as or better than large-scale au-tomated sequence annotation methods, supplying a fast analysis of the phosphatase gene products and providing a good deal of information that was not previously known about kinetoplastid parasite phosphatases. To increase the effi ciency of the system, however, we must address the issues of bias towards higher eukaryotes in InterPro for some sequences. One way to do this is to combine this analysis with other bioinformatics tools before the classifi cation stage. The workfl ow for extract-ing and initially analyzextract-ing data could be extended to accomplish this task.

4.5 Conclusion

Postgenomic bioinformatics presents new problems for the bioinformatician. The scale of data production has increased dramatically, while the pace of data analysis, annotation, and curation has not kept pace. Often, compromises on the quality of annotation have to be made in order to interpret large datasets quickly. By design-ing a system that will allow rapid, automated classifi cation to the fi ne-grained, subfamily level, the necessity to make such a compromise is avoided. This study demonstrates the advantages of combining community knowledge, in an ontology, with automated annotation methods.

Standard automated methods of annotation provide evidence for similarity to other known proteins, or provide lists of functional domains within a protein, but they do not allow the interpretation of this information. The strength of human-expert annotation is in this interpretation step. In a novel approach, the interpreta-tion step was replaced with further automainterpreta-tion. Using the technologies of formal description logics and ontological reasoning, community knowledge can be cap-tured and utilized for data analysis. The methodology does, however, hinge on the expert knowledge of a domain. If the data does not fi t the current knowledge of a particular area, this is also a useful outcome. It informs the scientists that their model needs revision or expansion.

The ontology system classifi ed the human protein phosphatases with equal competence as human experts, enabling confi dence to be placed in similar studies of the protein phosphatases of uncharacterized genomes. This was demonstrated by the results from the parasite genome analysis. It was also discovered that the ontology system was effi cient at uncovering novel, unexpected functional domains.

As the ontology classifi ed proteins according to what was already known, proteins exhibiting a different composition of domains were easily highlighted, identifying new targets for further scientifi c research.

This work focused on classifying proteins into family groups using domain-architecture analysis. There are many tools that can be employed in such a task, ei-ther instead of InterProScan, or in addition to it. In order to use more data sources and analysis tools in this investigation, we would simply have to extend the work-fl ow that extracts and analyses the data before the ontology classifi cation. In fact, this methodology is applicable in any area of biology in which class membership can be defi ned according to a set of properties that can be derived using auto-mated analysis tools. To date, the use of ontological technology in biology has been largely restricted to enhancing browsing and querying over existing data. Harness-ing the reasonHarness-ing capabilities of DL ontologies in this way to enable automated classifi cation could potentially have a great impact on bioinformatics analyses and approaches to automation in the future.

As well as extending the data-collection part of the ontology classifi cation process, we can also increase the expressivity of the protein class descriptions. For example, for the protein phosphatases, the order of p-domains was not important, but simply counting the number of each was suffi cient to distinguish between pro-teins from different subfamilies. In other protein families, however, the order of the p-domains would also need to be specifi ed. If we take the ABC transporters

4.5 Conclusion 79

as an example, the ABCD and ABCG subfamilies have exactly the same p-domain architecture. The only difference is the orientation of their two p-domains, an ATP-binding domain and a transmembrane region. ABCG proteins are referred to as re-verse transporters, as the ATP-binding domain is N-terminal to the transmembrane domain, which is the opposite orientation to the ABCD proteins.

Ontology use in the bioinformatics community is continuing to grow, provid-ing data-management solutions and the ability to defi ne concepts and terms across large, disparate research communities. Despite these advances, however, the full use of the reasoning capabilities of formal DL ontologies is not being exploited in many cases. The automated protein classifi cation using the ontology reasoning presented here demonstrates the extra advantages of using these capabilities. It is hoped that this system can be employed and exploited in future work, for example, in drug-target identifi cation and new genome annotation.

References

[1] Wolstencroft, K., et al., “A Little Semantic Web Goes a Long Way in Biology,” 4th Int. Se-mantic Web Conf., Vol. 3792, Galway, Ireland, November, 6–10, 2005, pp. 786–800.

[2] Wolstencroft, K., et al., 2006. “Protein Classifi cation Using Ontology Classifi cation,” Bio-informatics, Vol. 22, 2006, pp. e530–e538.

[3] Brenchley, R., et al., “The TriTryp phosphatome: Analysis of the Protein Phosphatase Cata-lytic Domains,” BMC Genomics, Vol. 8, No. 434, 2007.

[4] Altschul, S. F., et al., “Gapped BLAST and PSI-BLAST: A New Generation of Protein Da-tabase Search Programs,” Nucleic Acids Res., Vol. 25, 1997, pp. 3389–3402.

[5] Hulo, N., et al., (2004) “Recent Improvements to the PROSITE Database,” Nucleic Acids Res., Vol. 32, 2004, pp. D134–D137.

[6] Letunic, I., et al.,“SMART 4.0: Towards Genomic Data Integration,” Nucleic Acids Res., Vol. 32, 2004.

[7] Finn, R.D. et al., “The Pfam Protein Families Database,” Nucleic Acids Res.,Vol. 36 (Da-tabase Issue), January 2008, pp. D281-D288, Epub. November 26, 2007.

[8] Hunter, S., et al., “InterPro: The Integrative Protein Signature Database,” Nucleic Acids Res., Vol. 37 (Database Issue), January 2009, pp. D211-D215, Epub. October 21, 2008.

[9] Barford, D., A. K. Das, and M. P. Egloff, “The Structure and Mechanism of Protein Phos-phatases: Insights into Catalysis and Regulation,” Annu. Rev. Biophys Biomol. Struct., Vol. 27, 1998, pp. 133–164, Review.

[10] Tonks, N. K., and B. G. Neel, “Combinatorial Control of the Specifi city of Protein Tyrosine Phosphatases,” Curr Opin Cell Biol, Vol. 13, 2004, pp.182–195.

[11] Depaoli-Roach, A. A., et al., “Serine/Threonine Protein Phosphatases in the Control of Cell Function,” Adv Enzyme Regul, Vol. 34, 1994, pp.199–224.

[12] Zolnierowicz, S., and M. Bollen, “Protein Phosphorylation and Protein Phosphatases,” De Panne, Belgium, September 19–24, 1999, Embo J, Vol. 19, 2000, pp. 483–488.

[13] Schonthal, A. H., “Role of Serine/Threonine Protein Phosphatase 2A in Cancer,” Cancer Lett., Vol. 170, No. 1, 2001, pp. 1–13.

[14] Zhang, Z. Y., “Protein Tyrosine Phosphatases: Prospects for Therapeutics,” Curr. Opin.

Chem. Biol., Vol. 5, No. 4, 2001, pp. 416–423.

[15] Tian, Q., and J. Wang, “Role of Serine/Threonine Protein Phosphatase in Alzheimer’s Dis-ease,” Neurosignals, Vol. 11, No. 5, 2002, pp. 262–269.

[16] Barford, D., A. K. Das, and M. P. Egloff, “The Structure and Mechanism of Protein Phos-phatases: Insights into Catalysis and Regulation,” Annu. Rev. Biophys. Biomol. Struct., Vol. 27, 1998, pp. 133–164, Review.

[17] Andersen, J. N., et al., “A Genomic Perspective on Protein Tyrosine Phosphatases: Gene Structure, Pseudogenes, and Genetic Disease Linkage,” FASEB J., Vol. 18, No. 1, 2004, pp. 8–30, Review.

[18] Alonso, A., et al., “Protein Tyrosine Phosphatases in the Human Genome,” Cell, Vol. 117, No. 6, 2004, pp. 699–711, Review.

[19] Klumpp, S., et al., “Protein Histidine Phosphatase: A Novel Enzyme with Potency for Neu-ronal Signaling,” J Cereb Blood Flow Metab, Vol. 22, 2002, pp. 1420–1424.

[20] Nandurkar, H. H., and R. Huysmans, 2002. “The Myotubularin Family: Novel Phospho-inositide Regulators,” IUBMB Life,Vol. 53, 2002, pp. 37–43.

[21] Wishart, M. J., and J. E. Dixon, “PTEN and Myotubularin Phosphatases: From 3-Phospho-inositide Dephosphorylation to Disease,” Trends Cell Biol, Vol. 12, 2002, pp. 579–585.

[22] Andersen, J. N., et al., “A Genomic Perspective on Protein Tyrosine Phosphatases: Gene Structure, Pseudogenes, and Genetic Disease Linkage,” FASEB J., Vol. 18, No. 1, 2004, pp. 8–30, Review.

[23] Cohen, P., “Novel Protein Serine/Threonine Phosphatases: Variety is the Spice of Life,”

Trends Biochem. Sci., Vol. 22, No. 7, 1997, pp. 245–51, Review.

[24] Bollen, M., and W. Stalmans, “The Structure, Role, and Regulation of Type 1 Protein Phos-phatases,” Crit Rev Biochem Mol Biol, Vol. 27, 1992, pp. 227–281.

[25] Oinn, T., et al., “Taverna: A Tool for the Composition and Enactment of Bioinformatics Workfl ows,” Bioinformatics, Vol. 20, No. 17, 2004, pp. 3045–3054.

[26] Rice, P., I. Longden, and A. Bleasby,“EMBOSS: The European Molecular Biology Open Software Suite,” Trends in Genetics, Vol. 16, No. 6, 2000, pp. 276–277.

[27] Bechhofer, S., I. Horrocks, and D. Turi, “Implementing the Instance Store,” Computer Sci-ence, preprint CSPP-29, University of Manchester, 2004.

[28] Wang, J., et al., “A Unique Carbohydrate Binding Domain Targets the Lafora Disease Phosphatase to Glycogen,” J. Biol. Chem., Vol. 77, No. 4, 2002, pp. 2377–2380.

[29] International Human Genome Sequencing Consortium, “Finishing the Euchromatic Se-quence of the Human Genome,” Nature, Vol. 431, No. 7011, 2004, pp. 931–945.

[30] Kumar, R., et al., “A Zinc-Binding Dual-Specifi city YVH1 Phosphatase in the Malaria Parasite, Plasmodium falciparum, and Its Interaction with the Nuclear Protein, Pescadillo,”

Mol Biochem Parasitol., Vol. 133, No. 2, 2004, pp. 297–310.

[31] Bhaduri,A., and R. Sowdhamini, “A Genome-Wide Survey of Human Tyrosine Phos-phatases,” Protein Eng, Vol. 16, No. 12, 2003, pp. 881–888.

[32] Chagnon, M. J., N. Uetani, and M. L. Tremblay, “Functional Signifi cance of the LAR Re-ceptor Protein Tyrosine Phosphatase Family in Development and Diseases,” Biochem. Cell Biol., Vol. 82, No. 6, 2004, pp. 664–675.

[33] Kraus, P. R., and J. Heitman, “Coping with Stress: Calmodulin and Calcineurin in Model and Pathogenic Fungi,” Biochemical and Biophysical Research Communications, Vol. 311, 2003, pp. 1151–1157.

[34] Shin, D., et al., “Athb-12, a Homeobox-Leucine Zipper Domain Protein from Arabidopsis Thaliana, Increases Salt Tolerance in Yeast by Regulating Sodium Exclusion,” Biochem Biophys Res Commun, Vol. 323, 2004, pp. 534–540.

[35] Parsons, M., et al., “Comparative Analysis of the Kinomes of Three Pathogenic Trypanoso-matids: Leishmania Major, Trypanosoma Brucei and Trypanosoma Cruzi,” BMC Genom-ics, Vol. 6, 2005, p. 127.

[36] Berriman, M., et al., “The Genome of the African Trypanosome Trypanosoma Brucei,”

Science, Vol. 309, No. 5733, 2005, pp. 416–422.

4.5 Conclusion 81

[37] El-Sayed, N. M., et al., “The Genome Sequence of Trypanosoma Cruzi, Etiologic Agent of Chagas Disease,” Science, Vol. 309, No. 5733, 2005, pp. 409–415.

[38] Ivens, A. C., et al., (2005), “The Genome of the Kinetoplastid Parasite, Leishmania Ma-jor,” Science, Vol. 309, No. 5733, 2005, pp. 436–442.

[39] Peacock, C. S., et al., “Comparative Genomic Analysis of Three Leishmania Species That Cause Diverse Human Disease,” Nat Genet, Vol. 39, No. 7, 2007, pp. 839–847.

[40] Rose, N. R., “Infection, Mimics, and Autoimmune Disease,” J. Clin. Invest, Vol. 107, No. 8, 2001, pp. 943–944.

[41] Reithinger, R., et al., “Cutaneous Leishmaniasis,” Lancet. Infect Dis., Vol. 7, No. 9, 2007, pp. 581–596.

[42] Szoor, B., et al., “Protein Tyrosine Phosphatase TbPTP1: A Molecular Switch Control-ling Life Cycle Differentiation in Trypanosomes,” J Cell Biol, Vol. 175, No. 2, 2006, pp. 293–303.

[43] Erondu, N. E., and J. E. Donelson, “Characterization of Trypanosome Protein Phos-phatase 1 and 2A Catalytic Subunits,” Mol Biochem Parasitol, Vol. 49, No. 2, 1991, pp. 303–314.

[44] Chaudhuri, M., “Cloning and Characterization of a Novel Serine/Threonine Protein Phos-phatase Type 5 from Trypanosoma Brucei,” Gene, Vol. 266, Nos. 1–2, 2004, pp. 1–13.

[45] Orr, G. A., et al., “Identifi cation of Novel Serine/Threonine Protein Phosphatases in Try-panosoma Cruzi: A Potential Role in Control of Cytokinesis and Morphology,” Infect Im-mun, Vol. 68, No. 3, 2000, pp. 1350–1358.

[46] Mills, E., et al., “Kinetoplastid PPEF Phosphatases: Dual Acylated Proteins Expressed in the Endomembrane System of Leishmania,” Mol Biochem Parasitol, Vol. 152, No. 1, 2007, pp. 22–34.

[47] Attwood, T. K., “The PRINTS Database: A Resource for Identifi cation of Protein Fami-lies,” Brief Bioinform, Vol. 3, No. 3, 2002, pp. 252–263.

[48] Boudeau, J., et al., “Emerging Roles of Pseudokinases,” Trends Cell Biol, Vol. 16, No. 9, 2006, pp. 443–452.

[49] Lee, J. O., et al., “Crystal Structure of the PTEN Tumor Suppressor: Implications for Its Phosphoinositide Phosphatase Activity and Membrane Association,” Cell, Vol. 99, No. 3, 1999, pp. 323–334.

[50] Mukhopadhyay, R., and B. P. Rosen, “Arsenate Reductases in Prokaryotes and Eukary-otes,” Environ Health Perspect, Vol. 110, Suppl 5,, 2002, pp. 745–748.

83 C H A P T E R 5

GO-Based Gene Function and Network

Dans le document Data Mining in Biomedicine Using Ontologies (Page 94-100)