Unsupervised Clustering for Location Proteomics

Kai Huang and Robert F. Murphy

8.2.5 Unsupervised Clustering for Location Proteomics

Data clustering is an important tool for studying unlabeled data. It often involves segmenting the data into subgroups by introducing some

Table 8.17.Most distinguishable features evaluated by a univariatet-test on each feature in SLF6 from comparing two image sets: giantin and Gpp130. Data from reference [335].

Feature Conﬁdence level at which

feature diﬀers Eccentricity of the ellipse equivalent

to the protein image convex hull 99.99999

Convex hull roundness 99.9999

Measure edge direction homogeneity 1 99.9873

Average object size 99.9873

Average object distance to the center

of ﬂuorescence 99.9873

Ratio of largest to smallest object to

image center of ﬂuorescence distance 99.9873

distance/similarity measurement so that the examples in the same subgroup have more similarity than those in other subgroups. The most trivial segmentation is to isolate each data point as its own cluster, which provides the best within-group similarity. To avoid trivial segmentation, constraints are often introduced in data clustering algorithms such as the minimum size of a cluster and the maximum number of clusters. Given a distance metric, data clustering can also be regarded as a graph separation problem where each node represents a data point and each edge represents the distance between two nodes. Various graph partitioning criteria have been proposed such as min-cut [435], average-cut [346] and normalized-cut [366].

The organization of clusters can be either ﬂat or hierarchical.

Nonhierarchical clustering algorithms such as K-means and Gaussian mixture models partition data into separate subgroups without any hierarchical structure. The number of total clusters in these algorithms is either ﬁxed by the user or determined by some statistical criterion. On the contrary, hierarchical clustering algorithms require only a metric deﬁnition.

Agglomerative algorithms start from all data points, merge similar ones from each immediate lower level, and ﬁnally reach a single cluster root. Divisive algorithms, on the other hand, go in the opposite direction. The advantage of a hierarchical structure is that diﬀerent number of clusters can be obtained at each level so that we can choose the optimal number of clusters intuitively.

The subcellular location features have been proved to be able to measure the similarity of protein subcellular location patterns. They deﬁne a metric space for clustering protein ﬂuorescence microscope images according to their location similarity. In this section, we will describe how these features are used to build a subcellular location tree (SLT) that is a major goal of location proteomics.

182 Data Mining in Bioinformatics

Clustering the 10-Class 2D HeLa Images

Biologists have been studying tree structures of protein families and the evolution of organisms for a long time. Trees provide a clear view of how every element is related to each other. Similar to the sequence similarity metric for a phylogenetic tree, we can employ the location similarity metric described by SLF features to create a subcellular location tree for diﬀerent proteins.

As a ﬁrst attempt, we applied an average linkage agglomerative hierarchical clustering algorithm to create a dendrogram (subcellular location tree) for the 10 protein subcellular location patterns in the 2D HeLa image set [289].

The feature set we used was SLF8, containing 32 features. For each class of images, we calculated its mean feature vector and the feature covariance matrix. The distance between each class pair was the Mahalanobis distance between the two mean feature vectors. A dendrogram was then created based on the Mahalanobis distances by an agglomerative clustering method (Figure 8.9). As expected, both giantin and Gpp130 were grouped ﬁrst followed by the lysosome and endosome patterns. The grouping of tubulin and the lysosome and endosome patterns also agrees with biological knowledge in that both lysosomes and endosomes are thought to be involved in membrane traﬃcking along microtubules.

Clustering the 3D 3T3 Image Set

Instead of clustering only 10 protein subcellular location patterns, we can imagine applying the algorithm used earlier to cluster all proteins expressed in a given cell type. As mentioned before, the 3D 3T3 image set was created by the CD tagging project [200, 201, 396], whose goal is to tag all expressed proteins in this cell type. Since the project is ongoing, we applied our clustering method on an early version of the CD-tagging protein database containing 46 diﬀerent proteins [73].

The number of 3D images in each of the 46 clones ranges from 16 to 33, which makes the eﬀect of a single outlier noticeable. To obtain robust estimation of the mean feature vector for each class, we ﬁrst conducted outlier removal from the 3D 3T3 image set. Either aQ-test (when a clone has fewer than 10 cells) or a univariatet-test (more than 10 cells per clone) was carried out on each feature. A cell was regarded as an outlier if any one of its features failed the test. After outlier removal, we ended up with 660 full cell images for 46 clones, each of which has 9 to 22 cells.

We ﬁrst clustered the 46 clones by using the 14 SLF9 features that do not require a DNA label. All 14 features werez-scored across all clones so that they had mean zero and variance one. Euclidean distances computed from pairs of class mean feature vectors were employed as the distance metric in an agglomerative clustering method to create the dendrogram shown in Figure 8.10. There are two nuclear protein clusters in the tree: Hmga1-2,

Fig. 8.9. Subcellular location tree created by an average linkage agglomerative hierarchical clustering algorithm for the 10 protein subcellular location patterns from the 2D HeLa dataset. From reference [289].

Hmga1-1, Unknown-9, Hmgn2-1, and Unknown-8; Ewsh, Unknown-11, and SimilarToSiahbp1. Two representative images were selected from these two clusters (Figure 8.11). Apparently, one cluster is exclusively localized in the nucleus while the other has some cytoplasmic distribution outside the nucleus, which made these two nuclear protein clusters distinguishable.

To select the optimal number of clusters from Figure 8.10, we applied a neural network classifier with one hidden layer and 20 hidden nodes to classify the 46 clones by using the 14 SLF9 features. The average recall after 20 cross validations was 40%, which indicated that many clones were hardly distinguishable. To choose a cutting threshold for the tree, we examined the confusion matrix of the classifier and found that those clones separated below 2.8z-scored Euclidean distance can hardly be distinguished by the classifier.

By choosing a cutting threshold of 2.8, the tree shown in Figure 8.10 can be reduced to 12 clusters, which was consistent with the result obtained by using the K-means algorithm and Akaike information criterion on the same data. By grouping images from 46 clones to the new 12 clusters, the same neural network gave an average performance of 71% across 12 classes.

184 Data Mining in Bioinformatics

Fig. 8.10.Subcellular location tree created for 46 clones from the 3D 3T3 collection by using 14 of the SLF9 features. The protein name (if known) is shown for each clone, followed by the presumed location pattern from the relevant literature.

Independently derived clones in which the same protein was tagged are shown with a hyphen followed by a number (e.g., Hmga1-2 is clone 2 tagged in Hmga1). From reference [73].

Fig. 8.11.Two representative images selected from the two nuclear protein clusters shown in Figure 8.10. (A) Hmga1-1. (B) Unknown-11. From reference [73].

The second feature set we used for clustering the 46 clones was the 42-dimensional SLF11, which contains 14 SLF9 features that do not require a DNA label, 2 edge features, and 26 3D Haralick texture features. Just as for supervised learning, data clustering can also beneﬁt from feature selection.

Therefore, we employed stepwise discriminant analysis coupled with the same neural network classifier used earlier to select the features from SLF11 that can distinguish the 46 classes as well as possible. Figure 8.12 shows the average classification results of sequential inclusion of features ranked by SDA. The first 14 features ranked by SDA can give 70% average accuracy, while a comparable 68% can be achieved by using the first 10 features.

Fig. 8.12. Determination of minimum number of features for adequate discrimination of the 3D 3T3 clones. The average performance of a neural network classiﬁer after 20 cross validation trials is shown as a function of the number of features used from those selected by stepwise discriminant analysis (SDA). From reference [73].

By using the top 10 features selected from SLF11 by SDA, we ran the same clustering algorithm on the 46 3T3 clones. Figure 8.13 shows a new tree generated from clustering the same data by the new features. The previous two nucleus protein clusters still remained mostly the same. The new clone added in the second cluster, Unknown-7, has a hybrid location pattern in both nucleus and cytoplasm, which agrees with our previous observation of the distinction between the two clusters.

The tree created by clustering 3T3 clones from the CD tagging project provides a systematic representation of observed subcellular location

186 Data Mining in Bioinformatics

Fig. 8.13.Subcellular location tree created for 46 clones from the 3D 3T3 set by using the top 10 features selected from SLF11 by SDA. The results shown are for the same clones as in Figure 8.10. From reference [73].

patterns. By examining the tree, the characteristics of an unknown protein may be deduced from nearby proteins with known functions and similar location pattern. The sensitivity of the subcellular location features assures the distinction and separation of patterns in the tree with subtle location diﬀerences.

8.3 Conclusion

In this chapter, we have described the intensive application of machine learning methods to a novel problem in biology. The successful application of supervised learning, statistical analysis, and unsupervised clustering all depended on informative features that were able to capture the essence of protein subcellular location patterns in ﬂuorescence microscope images.

Our automatic image interpretation system coupled with high-throughput random-tagging and imaging techniques provide a promising and feasible capability for decoding the subcellular location patterns of all proteins in a given cell type, an approach we have termed “location proteomics.” The

systematic approach we have described is also adaptable to other data mining areas in bioinformatics, in which a successful system should be able to address all the aspects of a learning problem, such as feature design and extraction, feature selection, classiﬁer choice, statistical analysis, and unsupervised clustering.

Acknowledgments

The original research described in this chapter was supported in part by research grant RPG-95-099-03-MGO from the American Cancer Society, by NSF grants BIR-9217091, MCB-8920118, and BIR-9256343, by NIH grants R01 GM068845 and R33 CA83219, and by a research grant from the Commonwealth of Pennsylvania Tobacco Settlement Fund. K.H. was supported by a Graduate Fellowship from the Merck Computational Biology and Chemistry Program at Carnegie Mellon University established by the Merck Company Foundation.

Chapter 9

Dans le document Advanced Information and Knowledge Processing (Page 182-190)