Statistical Analysis for Image Sets - Kai Huang and Robert F. Murphy

Kai Huang and Robert F. Murphy

8.2.4 Statistical Analysis for Image Sets

The high accuracy achieved in supervised learning of protein subcellular location patterns illustrates that the subcellular location features are good descriptors of protein ﬂuorescence microscope images. This ﬁnding lends strong support to applying the subcellular location features in other applications such as hypothesis tests on image sets. Statistical analysis on image sets is often desirable for biologists in interpreting and comparing experimental results quantitatively. Two statistical analyses will be described in this section, objective selection of the most representative microscope image from a set [270] and objective comparison of protein subcellular distributions from two image sets [335].

Objective Selection of Representative Microscope Images

Current ﬂuorescence microscopy techniques allow biologists to routinely take many images in an experiment. However, only a few images can be included in a report, which forces biologists to select representative images to illustrate their experimental results. Prior to the work described here, no objective selection method was available to authors, and readers of an article would typically have little information about the criteria that were used by the authors to select published images.

To address this situation, we have described a method in which each protein ﬂuorescence microscope image is represented by SLF features and distance metrics are deﬁned on the feature space to quantify image similarity [270]. The most representative image is the one that is the closest to the centroid of the image set in the feature space (the mean feature vector of an image set), and all other images are ranked by distance from this centroid.

We tested variations on this approach using mixed sets of protein patterns and observed that the best results were obtained using outlier rejection methods so that the centroid can be reliably estimated. Figure 8.8 shows

178 Data Mining in Bioinformatics

example images for the Golgi protein giantin chosen by some of the typicality methods. The giantin images with scattered structure were ranked as least typical while those with compact structure were ranked as most typical. This ranking is consistent with biological knowledge about the Golgi complex, which decomposes during mitosis or under abnormal cell conditions. (The cell in panel G appears to be compact but on close inspection has a single dim vesicle, which may indicate the onset of Golgi breakdown and which makes the pattern atypical.)

Fig. 8.8.Application of typicality ranking to a diverse image set. Giantin images ranking high in typicality (A–D) and ranking low in typicality (E–H) were chosen by the methods described in the text. From reference [270].

Objective selection of representative data points has general interest in data mining. In our problem, we are aiming at the best microscope image to represent a set acquired from experiments. In information retrieval, a summary that represents an article can be generated by selecting several sentences from the article objectively. Both distance metric and features are important for a successful selection. The results showed that Mahalanobis distance is a better distance metric than Euclidean distance. The ability to correctly select the most typical images from contaminated image sets assures the reliability of image typicality ranking in uncontaminated sets.

Objective Comparison of Protein Subcellular Distributions

Proteins can change their subcellular location patterns under diﬀerent environmental conditions. Biologists are often interested in such changes caused by pharmacological treatments or hormones. Traditionally, visual

examination was employed to compare the fluorescence microscope image sets from two or more different conditions. This method was not very sensitive and was not suitable for objective and quantitative analysis of protein subcellular location changes. With the development of protein subcellular location features, objective and quantitative analysis of protein subcellular distributions has become possible. Instead of comparing two image sets visually, a subcellular location feature matrix can be calculated for each image set and statistical techniques can be applied to compare the two feature matrices. We used two statistical hypothesis tests for this task, namely univariatet-test and HotellingT²-test [229]. The HotellingT²-test is a multivariate counterpart of the univariatet-test; it yields a statistic following an F distribution with two degrees of freedom: the total number of features and the total number of images in the two sets minus the total number of features. The critical F value given a confidence level can be compared to the F value from HotellingT²-test of two image sets.

To characterize this approach, we used the 2D HeLa image collection.

Each image was described by the feature set SLF6 (a combination of Zernike moment features and SLF1). We ﬁrst compared all pairs of classes in the 2D HeLa set and the results are shown in Table 8.15. All class pairs were regarded as distinguishable since their F values were larger than the critical value. The distribution of the F values corresponded well to the classiﬁcation results.

Giantin and Gpp130 as well as endosome and lysosome patterns were the least distinguishable in both classiﬁcation and image sets comparison. The well-classiﬁed DNA pattern was also the easiest pattern to be distinguished from all other image sets with an average F value of 180 across nine comparisons.

To examine whether the Hotelling T²-test as we applied it was not only able to distinguish diﬀerent patterns but also able to correctly recognize indistinguishable patterns, we chose the two largest classes, tfr and phal, and randomly sampled two equal subsets from each class 1000 times. The Hotelling T²-test was conducted to compare the two sets drawn from the same class and the results are summarized in Table 8.16. The average F value for each class is less than the critical F value and less than 5% of the total of 1000 comparisons failed (as expected for a 95% conﬁdence interval).

Therefore, the method we employed to compare two image sets is able to identify two same protein subcellular location distributions.

One question that might be asked is whether the difference identified by the statistical test in closely related patterns is due to artifactual protocol differences rather than significant subcellular distribution change. To address this question, we conducted the same test on two image sets prepared under different experimental conditions. One image set was acquired by tagging giantin with an antibody collected from rabbit antiserum and the other by mouse antigiantin monoclonal antibody. The F value with 95% confidence level from these two sets was 1.04 compared to the critical value 2.22.

180 Data Mining in Bioinformatics

Table 8.15. Hotelling T²-test comparing all class pairs in the 2D HeLa dataset using SLF6. The critical F values with 95% confidence level range from 1.42 to 1.45 depending on the number of images in each pair. Note that all pairs of classes are considered to be different at that confidence level. Data from reference [335].

Class No. of DNA ER Gia Gpp Lam Mit Nuc Act TfR

images

DNA 87

ER 86 83.2

Gia 87 206.1 34.7

Gpp 85 227.4 44.5 2.4

Lam 84 112.2 13.8 10.7 11.4

Mit 73 152.4 8.9 39.2 44.5 15.9

Nuc 73 79.8 39.8 17.2 15.1 14.5 46.6

Act 98 527.2 63.5 325.3 354.0 109.8 16.0 266.4

TfR 91 102.8 7.4 14.8 15.6 2.8 9.2 20.5 29.1

Tub 91 138.3 10.8 63.0 72.2 18.4 7.0 49.4 22.4 5.5

Table 8.16.HotellingT²-test comparing 1000 image sets randomly selected from each of the two classes using SLF6. Data from reference [335].

tfr phal

Average F 1.05 1.05

Critical F(0.95) 1.63 1.61

Number of failing sets out of 1000 47 45

Therefore, potential minor diﬀerences introduced by experimental protocols were appropriately ignored by the method.

Various features in SLF6 might contribute diﬀerently to distinguish two distributions. We therefore conducted a univariatet-test on each feature and computed the conﬁdence level of each feature change. The results on giantin and Gpp130 image sets are shown in Table 8.17. Features that describe the shape of the pattern and object characteristics account for the major distinction between giantin and Gpp130.

The objective comparison of protein subcellular distributions by subcellular location features and statistical tests enables automatic and reproducible analysis. The sensitivity and reliability of this method have been proved by experiments comparing both different and identical protein subcellular distributions. The method can be used to study different effects of hormones or pharmacological agents on the subcellular location of certain target proteins. It also has potential in high-throughput drug screening where the relationship between various candidate chemicals and target genes can be studied quantitatively in terms of subcellular distribution.

Dans le document Advanced Information and Knowledge Processing (Page 179-182)