The Phosphatase Ontology - Analyzing and Classifying Protein Family Data Using OWL Reasoning

Analyzing and Classifying Protein Family Data Using OWL Reasoning

4.2.3 The Phosphatase Ontology

The ontology is a representation of expert knowledge in the area of protein phos-phatases. It is a model that describes the properties of each family and subfamily and the physical p-domain features required for inclusion in each family and sub-family. It contains descriptions of each major protein phosphatase family, the PTP, PPM, PPP, and lipid phosphatases, with a set of necessary and suffi cient proper-ties (conditions) for membership in each. These major families are further divided into subfamilies, and additional properties for each subsequent subfamily are also

Figure 4.1 Relationships between the different family and subfamily groups in the protein phos-phatase family.

described. This creates a hierarchy with any subfamily group inheriting properties from the more general family classes. The open-world assumption, which underpins the OWL language, is essential for the classifi cation to work correctly. The same is true for the use of disjoint axioms and also qualifi ed cardinality.

For example, a tyrosine phosphatase must contain a tyrosine phosphatase catalytic domain, and a serine/threonine phosphatase must contain a serine/threo-nine catalytic domain. Not only must the domains be present, but their presence is enough to recognize any given protein as a member of the class in question. These are the necessary and suffi cient conditions for membership in one of the main fam-ily groups. A protein is an instance of the tyrosine phosphatase famfam-ily if it contains at least one PTP catalytic domain. This is the diagnostic property for this protein family.

The OWL axioms say what must be present, but do not indicate that these domains are the only ones present; there may be others. It simply has not, as yet, been stated. If there are other domains present, they may enable further classifi ca-tion into subfamilies, or they may simply be other properties of a protein sequence that do not feature in the classifi cation. The open-world assumption allows for this style of description. A class defi nition is constructed from a collection of properties and axioms that are necessary and suffi cient to distinguish it from other classes, but an individual may have more properties besides these, providing it has these as a minimum.

Disjoint axioms are equally useful for classifi cation and distinguishing close-ly related individuals. For example, the defi ned classes tyrosine phosphatase and serine/threonine phosphatase are sibling classes and are disjoint. This means that protein sequences can satisfy only one of these sets of properties and be placed as an individual in only one of these classes. Disjointness does not need to be asserted between defi ned classes, as the necessary and suffi cient criteria enable the reasoner to work out to which classes any instance belongs, or which defi ned class might subsume another. Disjointness can also be asserted between primitive classes. Such axioms help describe to which classes an instance belongs.

The segregation of the protein phosphatases at the level of tyrosine verses ser-ine/threonine refl ects the current knowledge of the domain. There are currently no examples of phosphatases that contain both catalytic domains. If we were to fi nd data to suggest otherwise, we would have to refi ne the ontology model.

The same disjoint axiom pattern is used throughout the ontology, allowing the reasoner to differentiate and classify instances between sibling class assignments.

Qualifi ed cardinality restrictions are applied in areas in which two sibling class def-initions contain the same combination of p-domains, but in different quantities.

Figure 4.2 illustrates the p-domain architectures of the PTP family, demonstrat-ing that each can be differentiated by catalogdemonstrat-ing and countdemonstrat-ing the presence and absence of p-domains. Figures 4.3(a) and 4.3(b) provide examples of class defi ni-tions from the ontology.

The full protein phosphatase ontology is available at: http://www.bioinf.man-chester.ac.uk/phosphabase/links.html. The ontology describes classes of phos-phatases, but not individual proteins. An Instance Store [27] was used in order to reason over the descriptions of individual proteins and to enable the storage of those descriptions. Instances can be asserted and stored in OWL ontologies (i.e.,

4.2 Methods 69

inside Protégé²), but at the start of this work there was a potential problem with scalability. Adding more than approximately 1,000 instances affected the perform-ance of reasoning over the data. It was not expected that individual genomes would typically contain more than this number of phosphatases, but to make use of such a technology in comparative studies, it was a consideration.

2. http://www.co-ode.org/downloads/protege-x/

Figure 4.2 The differences in domain architecture of the receptor tyrosine phosphatase subfamily.

White oval = phosphatase catalytic domain. Vertical bar = transmembrane region, white triangle = immunoglobulin domain, black hexagon = fi bronectin domain, black diamond = MAM domain, black rectangle = carbonic anhydrase domain, grey oval = adhesion recognition site, white square = glycosylation, and black oval = cadherinlike domain

Figure 4.3 (a) Necessary and suffi cient properties for membership in the protein tyrosine phos-phatase family and (b) necessary and suffi cient properties for membership in the receptor tyrosine phosphatase 2A subfamily. R2A is a subclass of tyrosine phosphatase, so the R2A defi nition, therefore, also fulfi lls the conditions for membership in the tyrosine phosphatase family.

These problems have now been overcome with a combination of improvements in description-logic reasoners, such as Pellet,³ and improvements in ontology-ed-iting tools, such as Protégé. Protégé 4 is more scalable, so the use of the Instance Store may no longer be required for this methodology.

The Instance Store combines a description-logic reasoner with a relational da-tabase. The OWL ontology is loaded into the Instance Store, and the reasoner uses this to perform the task of classifi cation; that is, from the OWL instance descrip-tions given, it determines the appropriate ontology class for an instance description.

The relational database provides the stability, scalability, and persistence necessary for this work. The Instance Store itself provides a relatively simple programmatic interface, allowing the assertion of descriptions and queries against the set of in-stances. It uses highly optimized algorithms to denormalize datasets as they are asserted and later determine whether the information in the database is suffi cient to answer queries, or whether reasoning is required.

4.3 Results

Dans le document Data Mining in Biomedicine Using Ontologies (Page 84-87)