• Aucun résultat trouvé

Mapping Gene to Pathways Using a Disjunctive OFRS .1 Disjunctive OFRS

Dans le document Data Mining in Biomedicine Using Ontologies (Page 138-144)

Mapping Genes to Biological Pathways Using Ontological Fuzzy Rule Systems

6.4 Application of OFRSs: Mapping Genes to Biological Pathways

6.4.1 Mapping Gene to Pathways Using a Disjunctive OFRS .1 Disjunctive OFRS

The disjunctive rules of the OFRS used in [9] have the form

[ ]

1 1

Rule i: IFxis Gi OR... ORx is Gin THENPi i∈1,m (6.6) The activation, ai, of each rule for an input x is computed as

( )

1,

{ }

where OR is a disjunctive operator, and wij is calculated using the ontological simi-larity between x and Gi, s(x, Gij). In this application, we do not use any rule ag-gregation, as in (6.4) and (6.5). Instead, we use the activation of each rule ai to associate each input gene to an m-dimensional output vector A = (a1, ..., am), A ∈ Rm.

For the case of mapping genes to pathways, the input variable of the OFRS is a gene annotated with terms from the Gene Ontology (GO), and the output variable is a KEGG pathway (http://www.genome.ad.jp/kegg). The concrete form of the above OFRS rule is

is GENE i1 ... is GENE in is PATHi

gene gene pathway

IF OR OR THEN (6.8)

where GENEij are genes identifi ed by KEGG as being present in pathway PATHi. In fact, the OFRS consists in the KEGG pathway database itself. The OFRS has just one gene as input variable. The output of the OFRS is the membership of the input gene in a pathway PATHi. As mentioned above, the OFRS (6.8) maps each gene into an m-dimensional feature vector that represents the membership in each pathway. Next, we present an example of computing the activation of a rule (6.8) for a given input gene.

Example 6.4 Given the rule “IF gene is BCL2 OR gene is APAF1 THEN path-way is APOPTOSIS” we compute the rule activation for CASP9. Using the GO Web site, http://www.geneontology.org, we obtain the following annota-tions (only two shown) for each of the three H. sapiens genes mentioned above:

CASP9={GO:0008632: apoptotic program, GO:0008635: caspase activation via cytochrome c}, BCL2={GO:0006916: antiapoptosis, GO:0006959: humoral im-mune response}, and APAF1={GO:0008635: caspase activation via cytochrome c, GO:0042981: regulation of apoptosis}. Using the term-similarity method from (6.1) and the GO snippet from Figure 6.5, we obtain the following term-similarity matrix inTable 6.1. For example, the similarity between GO terms GO:0008635 and GO:0008632 is computed as 0.94 × 0.4 = 0.26.

The “relatedness” of the CASP9 to BCL2, w1, is given by their GO similarity computed using the normalized pairwise similarity (6.2): w1 = s(CASP9, BCL2)

= s({GO:0008632, GO:0008635}, {GO:0006919, GO:0006959}) = [(0.11+0.32+

0.02+0.001)/4]/ [max{(1+1+0.04+0.02)/4, (1+1+0.03+0.26)/4}] = 0.2. Similarly, the membership of CASP9 in APAF1 is w2 = 0.72. The rule activation (6.7) is a =

max{w1, w2} = 0.72. The rule activation is high, as it should be, since CASP9 is part of the apoptosis pathway.

6.4.1.2 Gene-Mapping Algorithm

The input of the mapping algorithm is a set of GO annotated genes Q = {qi}i =

[1,N]. The goal of the algorithm is to fi nd the KEGG pathways (their numbers and identities) that are involved in the expression of genes from the set Q. The pathway-prediction algorithm has the following steps:

Compute the activation

1. aij of each gene qi, i∈[1,N], in pathway j, j∈[1,m], using (6.7). As a result each gene i is described by a pathway activation (feature) vector Ai = (ai1,…,aim)∈Rm.

Compute the gene-similarity matrix,

2. S = {sij}i,j∈[1,N], as

Table 6.1 GO Term-Similarity Matrix Computed with (6.1) and the GO Snippet from Figure 6.5

GO:0008632 GO:0008635 GO:0006916 GO:0006959 GO:0042981

GO:0008632 1 0.03 0.15 0.04 0.36

GO:0008635 0.26 1 0.11 0.03 0.26

GO:0006916 0.32 0.11 1 0.04 0.32

GO:0006959 0.02 0.001 0.01 1 0.02

GO:0042981 0.36 0.03 0.15 0.04 1

Figure 6.5 Gene Ontology snippet for the terms used in Example 6.4.

6.4 Application of OFRSs: Mapping Genes to Biological Pathways 123

where AT denotes that the vector A was thresholded with a threshold T (that is, if aij< T, then aij= 0). The thresholding operation was performed in order to remove the noise (pathways with residual activation). The best threshold was determined experimentally [9] to be T = 0.5.

Use a clustering algorithm, together with a cluster validity measure, to 3.

assess the most likely number C of pathways (clusters) present in Q. We used the fuzzy C-means algorithm [10] to cluster the genes represented by the feature vectors {Si}i = 1,N into C clusters, where Si= (si1, …,siN) and the partition coeffi cient [10] to estimate the number of clusters. We found that it is more reliable to cluster the similarity matrix S using fuzzy C-means, rather than the feature vectors {Ai} directly. Another possible approach to clustering a similarity matrix is to use a relational clustering algorithm such as non-Euclidean relational fuzzy C-means [11], together with a relational clustering validity measure, such as the correlation cluster validity [12] (as shown in Chapter 3).

and |I| denotes the cardinality of I. The pathway that is more likely for the genes in cluster c to be active in is the one for which the sum of the activations in cluster c is maximum. If we denote this pathway by Pk, k[1,m], then k is obtained using

=

To produce more than one candidate pathway for a cluster, we can consider the pathway that has the second highest sum activation in the cluster, and so on.

The evaluation of the mapping that was performed using the detection rate 5.

(DR, sensitivity) is computed as

_ _ _

_ _ _

no pathways correct predicted

DR= total no correct pathways (6.11)

The false prediction rate (FPR) is computed as

_ _ _

_ _ _

no pathways erroneously predicted

FPR= total no pathways predicted (6.12)

For example, if the KEGG IDs of the correct pathway are {10, 940, 3050}, and our prediction is {10, 940, 3030, 4070}, then DR = 0.66 and FPR = 0.5. We note

that, since we ignore that the pathways 3050 and 3030 are strongly related, our DR estimate is conservative.

We also estimate the p-value of our DR prediction by randomly assigning the membership of the N genes in C clusters and recomputing the detection rate, DR*. We perform the random assignment 1,000 times, resulting in a set of 1,000 random detection rates, {DR*j j}=1,1000. Then, the p-value is calculated as

that is, the number of the random detection rates higher than our DR (obtained by clustering Si′s) divided by 1,000. The p-value is a measure of the reliability of our classifi er. If the p-value is low (e.g., lower than 0.05), a low detection rate might be due to a gene set that is hard to predict and not to a bad prediction method.

6.4.1.3 Testing the Mapping Algorithm on 10 H. Sapiens Gene Sets

The algorithm described in Section 6.4.1.2 was used with KEGG pathways for Homo sapiens and Arabidopsis thaliana as fuzzy rule system databases. Usually, the fuzzy rules are set up by domain experts. In our case, the memberships of genes in pathways (the rule base) were determined by biologists and stored in the KEGG database. An alternative way of building the OFSR is to employ an item set (asso-ciation rules) mining method for fi nding the rules.

For testing, we used the July 2006 version of the KEGG pathway database for H. sapiens. We tested the algorithm using 10 sets of 15 genes each, randomly select-ed (without replacement) from KEGG pathways that have more than 50 genes. The reason for this condition was that we tried to minimize the impact on the whole pathway at the extraction of 5 genes from it. We found 23 such pathways out of the m = 181 H. sapiens pathways considered. Each set of genes was extracted from three pathways (5 genes per pathway).

The results obtained on the H. sapiens test set are presented in Table 6.2. The prediction was made by considering one candidate pathway (the one that had the maximum activation sum) per cluster and using a feature threshold of T = 0.5.

As we can see from Table 6.2, over-clustering (like in the sets numbered 2, 3, 8, and 10) leads to an increase in false predictions. Sometimes clusters may be merged, if they predict the same pathway. However, we leave pruning strategies for further research.

We mentioned that predicting the right pathways (as for set 6) does not neces-sarily mean that we assigned all the genes to the correct pathways in the process.

For example in set 6, we assigned only 13 out of 15 genes (87%) to the correct pathways. In Figure 6.6 we show the gene-similarity matrix computed using (6.9) and the pathway features for set 6.

We see that genes 4 and 12 (circled) exhibit more similarity to the genes from pathway 2 (gene index 6–10) than to their own pathways (gene index 1–5, and gene index 11–15, respectively). On average, we predicted about 45% of the genes in the right pathway.

6.4 Application of OFRSs: Mapping Genes to Biological Pathways 125

6.4.1.4 Predicting the Pathways Involved in an Arabidopsis Thaliana Microarray Dataset

The pilot dataset used for further testing of our method consisted of 526 A. thaliana genes selected in a microarray experiment. In this experiment, we considered m = 115 pathways from the July 2006 KEGG version. Out of 526 genes in the input set, we found only 438 to be annotated using a GO term. Since we did not use any automated annotation software in this work, we removed the 88 unannotated genes

Table 6.2 Pathway Prediction Results for 10 H. Sapiens Test Gene Sets Using One Candidate Pathway per Cluster

Set #

# Pathways Predicted,

C,(out of 3) DR FPR

# Genes in the correct pathway (out of 15)

1 3 0.67 0.33 9

2 5 0.67 0.60 4

3 5 1.00 0.40 7

4 3 0.67 0.33 10

5 3 0.67 0.33 5

6 3 1.00 0 13

7 3 0.33 0.67 3

8 4 0.33 0.75 2

9 3 0.67 0.33 9

10 4 0.67 0.50 5

Mean 0.66 0.43 6.7

Figure 6.6 Similarity matrix for the 15 genes selected in case 6 from Table 6.2. Genes 4 and 12 (circled) will be erroneously grouped by fuzzy C-means in pathway 2, (indices 6–10), instead of pathways 1 (indices 1–5) and 3 (indices 11–15), respectively.

from the experiment. To determine the most probable number of clusters, we used the partition coeffi cient [10] that resulted in C = 8 group of genes. In Table 6.3, we show the KEGG IDs for the three representative pathways found for each of the 8 clusters.

We see that most of the clusters are coherent; that is, the pathway candidates for a cluster are very similar. For example, cluster 1 has 7 genes, and the candidate pathways are oxidative phosphorilation (190), ATP synthesis (193), and photo-synthesys (195) (which are obviously related, since 193 is included in 190, and 195 and 190 are both related to the energy metabolism). Similarly, cluster 5 has 25 genes, and the candidates pathways are DNA polymerase (3030), transcription factor (3022), and ribosome (3010), which are all involved in the DNA replication process. Finally, cluster 8 has 69 genes involved in valine, leucine, and isoleucine degradation (280) and biosynthesis (290).

The similarity matrix for the 438 genes is shown in Figure 6.7.

In Figure 6.7, we can distinguish the 8 clusters described in Table 6.3. Further-more, by inspecting Figure 6.7 more carefully, we observe that the genes (circled) from cluster 4 (around index 200) and from cluster 7 (around index 350) seem to be highly similar. Table 6.3 confi rms this observation, since they share the second pathway candidate: sphingolipid metabolism (KEGG ID 600).

Although this method gave encouraging results for our pilot dataset, it has two potential problems that derive from the fact that it maps one gene at a time. First, by mapping one gene at a time, it is not considering the dependencies between the genes in a pathway. Second, mapping one gene at a time results in a low signal-to-noise ratio, due to the signal-to-noise produced by the similarity to various genes other than itself. Consequently, a better approach would be to map groups of genes at a time.

Since it is impossible to know a priori the grouping of the genes, this approach relies on an evolutionary strategy for estimating the number of pathways and their

Figure 6.7 The pathway similarity matrix between the 438 A. thaliana genes. The matrix has been rearranged, using the clusters obtained by applying fuzzy C-means on the initial similarity matrix.

6.4 Application of OFRSs: Mapping Genes to Biological Pathways 127

gene memberships. We describe an evolutionary approach for pathway estimation, based on an ontological fuzzy rule system, in Section 6.4.2.

6.4.2 Mapping Genes to Pathways Using an OFRS in an Evolutionary

Dans le document Data Mining in Biomedicine Using Ontologies (Page 138-144)