GPD 194 Analysis Using OSOM - Examples of NERFCM, CCV, and OSOM Applications

Clustering with Ontologies

3.5 Examples of NERFCM, CCV, and OSOM Applications

3.5.4 GPD 194 Analysis Using OSOM

We apply our ontological self-organizing map (OSOM) to produce cluster visual-ization and functional summarvisual-ization of the GPD₁₉₄ dataset.

3.5.4.1 GPD194 Visualization Using OSOM

We applied the OSOM algorithm described in Section 3.4 using a toroidal grid-based network with P = 400 neurons (a 20 × 20 matrix). The learning rates are {ε0

= 0.5, εf = 0.005}, the radii of the lateral infl uence function in (3.10) are {σ0 = 3.0, σf= 0.1}, and the maximum number of iterations is t_max= 10,000.

The visualization method maps the gene-product profi les (the OSOM proto-types) of the OSOM network to the nodes of the two-dimensional toroidal grid (see Figure 3.5).

Figure 3.4 The fi ve clusters identifi ed in the GPD₁₉₄ dataset from NERCM.

3.5 Examples of NERFCM, CCV, and OSOM Applications 57

To show the cluster tendency of gene products, the relations between neighbor-ing gene -product profi les on the grid are displayed as gray levels—black represent-ing no relation and white representrepresent-ing highly related.

The visualization method we propose is composed of two distinct steps. (1) the gene products are mapped to the trained OSOM network by the nearest prototype rule—for each gene product x, fi nd the best match prototype

[1, ]

arg min{ ( , )}

p i

i P

= ∈

w w x .

In this fashion, the node p of the network is associated with the gene product x.

As a result, similar gene products are mapped to groups of similar nodes in the network; (2) the similarity between neighboring OSOM nodes is mapped into a grayscale image—white showing high dissimilarity, black showing very low dis-similarity [16]. Figure 3.6(a) illustrates this mapping using the AVG disdis-similarity operator (3.11) and MAX update operator (3.13). The white regions correspond to groups of similar gene product, while the black regions show the boundaries between groups that are dissimilar. Please note that, due to the toroidal topology of the OSOM network, the top and bottom, as well as the sides, wrap around.

The dissimilarity between nodes is then calculated by an average operator

(^OSOM)

(

ⁱ^, ^j

)

ⁱ^t^D^M² ^j

S = wM w

w w (3.14)

And this dissimilarity is calculated between each node of the OSOM net-work in the up-down, left-right, and four diagonal directions. Thus, each pro-totype node has eight surrounding pixels that correspond to its dissimilarity to neighboring nodes. The grayscale color map is set such that white corresponds to max_{∀ ∀}_i_, _j[S⁽^OSOM⁾(w w_i, _j)] and black corresponds to min_{∀ ∀}_i_, _j[S⁽^OSOM⁾(w w_i, _j)] for a given network, where i ∈ [1,N_H], j ∈ [1,N_V], and N_H, N_V are the horizontal and

Figure 3.5 The toroidal grid used in the GPD₁₉₄ OSOM representation.

vertical dimensions of the grid, respectively (in our case, N_H = 20, N_V = 20). The color at the node location is interpolated from the eight surrounding pixels.

As a result of this coloring method, regions that are lightly colored represent groups of similar gene products, while darker regions signify outliers or gene prod-ucts that are dissimilar to the surrounding groups. In addition, the degree of dis-similarity can be seen in the intensity of the regions. For example, in Figure 3.6(a), the light region on the right is a highly similar group, while the more gray regions signify dissimilarity to a lesser degree, and the black regions denote boundaries be-tween dissimilar groups of gene products. In contrast to OSOM, in Figure 3.6(b), we show the same map obtained using the regular SOM, that is, the SOM where no ontological similarity was used.

The three GPD₁₉₄ families can be seen in Figure 3.6(a) as light-colored islands.

The collagen alpha chains are located in the top-left and bottom-left (recall that the grid is toroidal; hence, these two regions are actually connected). The myotubular-ins are located at the top-right and bottom-right. Lastly, the receptor precursors, which are the most tightly grouped gene products (they are mapped to a bright region), are located at the right-middle of the image. We note that the TEK gene was mapped into 2 nodes (10, 3) and (19, 10). This was due to the fact that, in this version of GO annotations, the gene product mapped to the node (10, 3) had the wrong annotation. In contrast, each family is broken in 2–4 pieces in the SOM map, as shown in Figure 3.6(b).

3.5.4.2 Functional Summarization of Gene Product Clusters

Functional summarization of the gene-product profi les is achieved by examining the OSOM prototype weight vectors. The ontological content of each OSOM prototype is represented by a vector, as discussed in Section 3.4. Each element of the prototype vector can be viewed as the infl uence of a specifi c GO annotation in defi ning the profi le of its associated OSOM node. Thus, high values in a prototype vector signify a high likelihood that the gene products mapped to that location in the OSOM are annotated by that specifi c term or by a term that is very similar, according to the specifi ed term-based dissimilarity measure. We defi ne the most representative term (MRT) of a gene-product profi le as the term that has the highest associated weight in the OSOM prototype vector.

The strength of the OSOM visualization method is that it shows the overall dissimilarity of the genes as seen by the three distinct islands, which represent the three families. However, groups are mapped to different locations due to minor dif-ferences in their ontological data. In Table 3.2, we present the MRTs for the entire trained OSOM network, as shown in Figure 3.6(a).

The terms from the Table 3.2 represent a functional summarization of all the gene-product groups present in the GPD₁₉₄ dataset. The dataset has been sum-marized using the following eight GO terms: protein amino acid dephosphoryla-tion, extracellular matrix structural constituent, kinase activity, receptor activity, protein-tyrosine kinase activity, ATP binding, cell adhesion, and collagen type IV.

The gene summarization was performed using only 8 of the 64 GO terms used in the annotation of the GPD₁₉₄ dataset.

3.6 Conclusion 59

3.6 Conclusion

In this chapter, we presented several algorithms that use ontologies. NERFCM, a fuzzy relational clustering algorithm, can be used to cluster objects described by ontology terms. The dissimilarity between objects can be computed as in Chapter 2, but also with other distance measures that can deal with multiple variable types

Figure 3.6 The OSOFM map (a) and standard SOM map (b) for the GPD194 dataset.

(see examples in [8, 38]). The resulting fuzzy cluster memberships can be used in automatic ontology annotation based on the guilt-by-association paradigm or in data summarization (see [27, 29] and Chapter 8 for more examples). Related to NERFCM, we presented CCV, a cluster-validity measure for relational datasets. It, too, can be used in data summarization.

Last, we presented OSOM, a version of the well-known self-organizing maps (SOM) algorithm, that was modifi ed to include Gene Ontology term-dissimilarity information.

We believe that the inclusion of ontological information in existent clustering algorithms can lead to new knowledge-discovery tools that are able to reveal new facets of the represented objects.

References

[1] Altschul, S. F., et al., “Basic Local Alignment Search Tool,” J Mol Biol, Vol. 215, No. 3, 1990, pp. 403–410.

[2] Bellazzi, R., and B. Zupan, “Towards Knowledge-Based Gene Expression Data Mining,”

J. of Biomedical Informatics, Vol. 40, 2007, 787–802.

[3] Ben-dor, A., and Z. Yakhini, “Clustering Gene Expression Patterns,” J. of Computational Biology, Vol. 6, 1999, pp. 281–297.

[4] Bezdek, J. C., Pattern Recognition with Fuzzy Objective Function Algorithms, New York:

Plenum, 1981, p. 272.

[5] Bezdek, J. C., and R. J. Hathaway, “VAT: A Tool for Visual Assessment of (Cluster) Ten-dency,” Proc. IJCNN 2002, HI, May 12–17, 2002, pp. 2225–2230.

Table 3.2 Most Representative Terms of the OSOM Network Shown in Figure 3.6 OSOM Index GO ID GO Defi nition

M:(17, 1) GO:0006470 Protein amino acid dephosphorylation C:(2, 3) GO:0005201 Extracellular matrix structural constituent R:(10, 3) GO:0016301 Kinase activity

M:(16, 4) GO:0006470 Protein amino acid dephosphorylation C:(1, 5) GO:0005201 Extracellular matrix structural constituent R:(17, 7) GO:0004872 Receptor activity

R:(16, 8 ) GO:0004713 Protein-tyrosine kinase activity R:(19, 10) GO:0005524 ATP binding

M:(13, 15) GO:0006470 Protein amino acid dephosphorylation C:(10, 16) GO:0005201 Extracellular matrix structural constituent C:(6, 17) GO:0005201 Extracellular matrix structural constituent C:(6, 18) GO:0005201 Extracellular matrix structural constituent M:(14, 18) GO:0006470 Protein amino acid dephosphorylation C:(6, 19) GO:0007155 Cell adhesion

C:(6, 20) GO:0005587 Collagen type IV

Note: (M)-myotubularin, (R)-receptor precursor, (C)-collagen

3.6 Conclusion 61

[6] Bezdek, J.C., et al., Fuzzy Models and Algorithms for Pattern Recognition and Image Pro-cessing, Boston: Springer, 1999, p. 796.

[7] Bolshakova, N., F. Azuaje, and P. Cunningham, “A Knowledge-Driven Approach to Clus-ter Validity Assessment,” Bioinformatics, Vol. 21, No. 10, 2005, pp. 2546–2547.

[8] Boriah, S., V. Chandola, and V. Kumar, “Similarity Measures for Categorical Data: A Com-parative Evaluation,” Siam 2008, pp 243–254.

[9] Enright, A. J., S. Van Dongen, and C. A. Ouzounis, “An Effi cient Algorithm for Large-Scale Detection of Protein Families,” Nucleic Acids Res, Vol. 30, No. 7, 2002, pp. 1575–1584.

[10] Frey, B.J., and D. Dueck, “Clustering by Passing Messages Between Data Points,” Science Vol. 315, No. 972, 2007.

[11] Handl, J., J. Knowles, and D. B. Kell, “Computational Cluster Validation in Post-Genomic Data Analysis,” Bioinformatics, Vol. 21, No. 15, 2005, pp. 3201–3212.

[12] Hathaway, R. J., and J. C. Bezdek, “NERF c-Means: Non-Euclidean Relational Fuzzy Clustering,” Pattern Recognition, Vol. 27, No. 3, 1994, pp. 429–437.

[13] Havens, T.C., et al., “Ontological Self-Organizing Maps for Cluster Visualization and Functional Summarization of Gene Products Using Gene Ontology dissimilarity Measures,”

World Congress on Computational Intelligence, WCCI2008, Hong Kong, June, 1–6, 2008, pp. 104–109.

[14] Henegar, C., et al., “Clustering Biological Annotations and Gene Expression Data to Iden-tify Putatively Co-Regulated Biological Processes,” J. Bioinf. Comp. Biol., Vol. 4, No. 4, August 2006, pp. 833–52.

[15] Huang, D., and W. Pan, “Incorporating Biological Knowledge into Distance Based Clus-tering Analysis of Microarray Gene Expression Data,” Bioinformatics, Vol. 22, 2006, pp. 1259–1268.

[16] Kaski, S., and T. Kohonen, “Exploratory Data Analysis by the Self-Organizing Map: Struc-tures of Welfare and Poverty in the World,” Neural Networks in Financial Engineering, Proc., 3rd Int. Conf. on Neural Networks in the Capital Markets, P. N. Refenes, et al., (eds.), London, Singapore: World Scientifi c, 1996, pp. 498–507.

[17] Klir, G. J., and B. Yuan, Fuzzy Sets and Fuzzy Logic: Theory and Applications, Upper Saddle River, New Jersey: Prentice Hall, 1995, p. 574.

[18] Kohonen, T., “Self-Organized Formation of Topologically Correct Feature Maps,” Biologi-cal Cybernetics, Vol. 43, 1982, pp. 59–69,

[19] Kohonen, T., “Self-Organizing Maps,” Proc. IEEE, Vol. 78, No. 9, September 1990, pp. 1464–1480.

[20] Kohonen, T., “Self-Organizing Maps,” Information Sciences, Vol. 30, 2004.

[21] Kustra, R., and A. Zagdanski, “Incorporating Gene Ontology in Clustering Gene Expres-sion Data,” 19th IEEE Symp. on Computer-Based Medical Systems, IEEE Computer So-ciety, 2006, pp. 555–563.

[22] Myllyharju, J., and K. Kivirikko, “Collagens, Modifying Enzymes, and Their Mutation in Humans, Flies, and Worms,” Trends in Genetics, Vol. 20, No. 1, 2004, pp. 33–43.

[23] Martin, D., et al., “GOToolBox: Functional Analysis of Gene Datasets Based on Gene On-tology,” Genome Biol. Vol. 5, No. 12, 2004, p. R101.

[24] Melton, G. B,, et al., “Inter-Patient Distance Metrics Using SNOMED CT Defi ning Rela-tionships,” J. of Biomedical Informatics, Vol. 39, No. 6, December 2006, pp. 697–705.

[25] Pal, N. et al., “Gene Ontology-Based Knowledge Discovery Through Fuzzy Cluster Analy-sis,” Neural, Parallel and Scientifi c Computation, Vol. 13, Nos. 3–4, 2005, pp. 337–361.

[26] Pedersen, T., et al., “Measures of Semantic Dissimilarity and Relatedness in the Biomedical Domain,” J. of Biomedical Informatics, Vol. 40, 2007, pp. 288–299.

[27] Popescu, M., et al., “Functional Summarization of Gene Product Clusters Using Gene Ontology Dissimilarity Measures,” in Proc. 2004 ISSNIP, M. Palaniswami, et al., (eds.), Piscataway, New Jersey: IEEE Press, 2004, pp. 553–559.

[28] Popescu, M., and J. Arthur, “OntoQuest: A Physician Decision Support System Based on Ontological Queries of the Hospital Database,” Proc. AMIA Fall Symp., Washington, D.C., November 2006, pp. 639–643.

[29] Popescu, M., and J. Keller, “Summarization of Patient Groups Using the Fuzzy C-Means and ICD-9 Ontology Dissimilarity Measures,” IEEE World Congress on Computational Intelligence, Vancouver, Canada, July 16–21, 2006, pp. 2998–3003.

[30] Popescu, M., J. M. Keller, and J. A. Mitchell, “Fuzzy Measures on the Gene Ontology for Gene Product Dissimilarity,” IEEE Trans. Computational Biology and Bioinformatics, Vol.

3, No. 3, July–September 2006, pp. 1–11.

[31] Popescu, M., et al., “A New Cluster Validity Measure for Bioinformatics Relational Data-sets,” World Congress on Computational Intelligence, WCCI2008, Hong Kong, June 1–6, 2008, pp. 726–731.

[32] Runkler, T., “Relational Fuzzy Clustering,” Advances in Fuzzy Clustering and Its Applica-tions, J. V. de Oliveira and W. Pedrycz (eds.), New York: John Wiley & Sons, Ltd., 2007, pp. 31–51.

[33] Schlicker, A, et al., “A New Measure for Functional Dissimilarity of Gene Products Based on Gene Ontology,” BMC Bioinformatics, Vol. 7, No. 302, 2006, pp 1–16.

[34] Speer, N., C. Spieth, and A. Zell, “Functional Grouping of Genes Using Spectral Clustering and Gene Ontology,” in Proc. of the IEEE Int. Joint Conf. on Neural Networks (IJCNN 2005), Piscataway, New Jersey: IEEE Press, 2005, pp. 298–303.

[35] Speer, N., C. Spieth, and A. Zell, “A Memetic Clustering Algorithm for the Functional Partition of Genes Based on the Gene Ontology,” Proc. of the 2004 IEEE Symp. on Com-putational Intelligence in Bioinformatics and ComCom-putational Biology (CIBCB 2004), Pis-cataway, New Jersey: IEEE Press, 2004, pp. 252–259.

[36] Wang, H., et al., “Gene Expression Correlation and Gene Ontology-Based Dissimilarity:

An Assessment of Quantitative Relationships,” Proc. of the 2004 IEEE Symp. on Compu-tational Intelligence in Bioinformatics and CompuCompu-tational Biology, La Jolla, CA, October 7–8, 2004, Piscataway, New Jersey: IEEE Press, 2004, pp. 25–31.

[37] Wang, J.Z., et al., “A New Method to Measure the Semantic Dissimilarity of GO Terms,”

Bioinformatics, Vol. 23, No. 10, 2007, pp. 1274–1281.

[38] Wilson, R, and T. Martinez, “Improved Heterogeneous Distance Functions,” JAIR, Vol. 6, 1997, pp. 1–34.

63 C H A P T E R 4

Analyzing and Classifying Protein Family

Dans le document Data Mining in Biomedicine Using Ontologies (Page 73-80)