Supervised Learning for Protein Subcellular Location

Kai Huang and Robert F. Murphy

8.2.3 Supervised Learning for Protein Subcellular Location

A supervised learning problem usually involves ﬁnding the relationship between the predictorsX and the dependent variableY, namelyY =f(X).

A typical supervised learning system takes a training data set and models the predictors as its inputs and the dependent variables as its outputs. The learning process is characterized by modifying the relationshipf learned by the system with regard to the diﬀerence between system predictionf(x_i) and expected outputy_iso that the system will generate close enough predictions to the desired outputs. The performance of a classiﬁer is often evaluated as the average accuracy on a test set. In our work, we use the average accuracy over all test set instances (images), which, since the number of instances per class is roughly similar, is close to the average performance over all classes.

Given the definitions of various feature sets in the previous section, we can transform a protein fluorescence microscope image to a number of features and train a classifier to learn the relationship between these features and the protein subcellular location patterns. We first introduce classification on 5-class CHO images by using a neural network classifier.

Classiﬁcation of 5-Class 2D CHO Images

As the ﬁrst trial of automatic recognition of protein subcellular location patterns in ﬂuorescence microscope images, a back-propagation neural network with one hidden layer and 20 hidden nodes was trained using Zernike moment features computed from the 2D CHO set [44, 45]. The images were divided into three sets (training/stop training/test) as follows: giantin 47/4/26, DNA 39/4/36, LAMP2 37/8/52, NOP4 25/1/7, tubulin 25/3/23.

The neural network was trained on the training set and the training was stopped when the sum of squared error on the stop set reached a minimum.

The test set was then used to evaluate the network. Table 8.5 shows the confusion matrix averaged over eight trials.

Table 8.5. Confusion matrix for test data using a neural network classiﬁer with Zernike moment features on the 2D CHO images. The average classiﬁcation accuracy is 87% on eight random trials and the corresponding training accuracy is 94%. Data from reference [45].

True Output of the classiﬁer

class Giantin DNA LAMP2 NOP4 Tubulin

Giantin 97% 0% 3% 0% 0%

DNA 3% 93% 0% 3% 0%

LAMP2 12% 2% 70% 10% 7%

NOP4 0% 0% 0% 88% 13%

Tubulin 0% 0% 12% 4% 85%

162 Data Mining in Bioinformatics

Each element in a confusion matrix measures how much a classiﬁer gets

“confused” between two classes. For instance, Table 8.5 shows that the neural network classifier incorrectly considers 12% of the LAMP2 images to represent a Giantin pattern. Each percentage at the diagonal of the matrix is the recall of the classifier for the corresponding class. The precision of the classifier for a specific class can be computed by dividing the number of correctly classified images on that class by the column-sum of images of the class in the matrix. Due to rounding, the sum of each row in a confusion matrix might not equal 100%. The average recall achieved by Zernike moment features and the back-propagation neural network is much higher than that of a random classifier, 20%. A similar neural network classifier trained using Haralick texture features in place of the Zernike moment features gave a similar overall accuracy [45].

The recognition of the 5 subcellular location patterns in 2D CHO image set showed that Zernike moment features and Haralick texture features are able to capture appropriate information from ﬂuorescence microscope images and a trained classiﬁer was able to give relatively accurate prediction on previously unseen images.

Classiﬁcation of 10-Class 2D HeLa Images

To test the applicability of automated classification to protein fluorescence microscope images for other patterns and cell types, we conducted supervised learning on the 10-class 2D HeLa image collection. This collection not only contains location patterns covering all major cellular organelles, it also includes patterns that are easily confused by human experts (such as giantin and gpp130). The SLF3 and SLF4 feature sets previously described were developed for these studies, but only modest classifier accuracies were obtained with these whole sets. This result was presumably due to having insufficient training images to determine decision boundaries in the large feature space. Significant improvement in classifier performance can often be obtained by selecting a smaller number of features from a large set. Here we will first review methods for decreasing the size of the feature space, then review various classifiers that can be applied to the features, and finally describe comparison of these approaches for the 2D HeLa collection.

Feature reduction.There are two basic approaches to feature reduction:

feature recombination and feature selection. The former recombines the original features either linearly or nonlinearly according to some criterion to reach a smaller set of features. The latter explicitly selects a small set of features from the original set by using some heuristic search. Following are descriptions of four methods from each category that are widely used for feature reduction.

Feature recombination

1. Principal component analysis (PCA), probably the ﬁrst feature recombination method adapted for data mining, captures the linear relationships among the original features. It projects the input data into a lower-dimensional space so that most of the data variance is preserved. To do that, the covariance matrixAof the input data is ﬁrst constructed:

A= 1 n

n j=1

x_jx^T_j, (8.11)

wherex_j(j= 1, . . . , n) represents them-dimensional feature vector of thejth image andnis the total number of images. The basis of the low-dimensional space is formed by choosing the eigenvectors of the covariance matrixAthat correspond to the largestkeigenvalues. By projecting the original data onto this new space, we get a k(k < m)-dimensional feature space in which the data are spread as much as possible.

2. Unlike the linear relationships obtained by PCA, nonlinear principal component analysis (NLPCA) is often used to obtain nonlinear combinations of the input features that capture as much of the information in the original set as possible. A ﬁve-layer neural network is often used to extract nonlinearly combined features [108]. In this network, the data set serves as both the inputs and the desired outputs. The second layer nonlinearly maps the input features to some space, and the middle layer of k nodes recombines these features linearly. The reverse operation is carried out at the fourth layer to attempt to make the outputs of the network equal its inputs. The training of the neural network stops when the sum of squared error stops decreasing.

The ﬁrst three layers are separated as a new neural network. By feeding the original data, the new network will generateknonlinear recombined features as its outputs.

3. Another way to extract nonlinearly combined features is to use kernel principal component analysis (KPCA). KPCA is similar to PCA except that it ﬁrst applies a nonlinear kernel function to the original data to map them to a new high-dimensional space in which normal PCA is conducted [354].

The assumption is that nonlinear relationships among original features can be captured through the nonlinear kernel transformation (represented asΦ below). A dot product matrixK can be constructed by taking dot products between any two data points in the new feature space,

K(i, j) =Φ(x_i)•Φ(x_j) i, j∈1,2, . . . , n, (8.12) where x_i is the m-dimensional feature vector describing the ith image and Φ(x_i) is new feature vector in the high-dimensional space. This matrix K is similar to the covariance matrix A used in normal PCA. Eigenvalue

164 Data Mining in Bioinformatics

decomposition is further conducted onKresulting in a group of eigenvectors that form the new basis of the high-dimensional space. Projecting the original data to the new space will give us the nonlinear recombined features. The maximum number of new features is determined by the total number of data pointsn, as can be seen in the deﬁnition ofK. Therefore, KPCA is not only a feature reduction method but can also work as a feature expansion method.

4. In ideal pattern recognition, the input variables, features, should be statistically independent from each other so that the information representation eﬃciency is maximized. Independent component analysis (ICA) is used to extract statistically independent features from the original data [108]. Givenn m-dimensional data points, we can deﬁne a source matrix sand a transformation matrixB as

D=sB, (8.13)

whereDis ann×moriginal data matrix,sis ann×dsource matrix containing dindependent source signals, and B is a d×mtransformation matrix. We can assume that sis formed by a linear transformation of D followed by a nonlinear mapping [108]:

s=f(W D+w₀), (8.14)

whereW andw₀ are weights involved in the linear transformation and f is often chosen as a sigmoid function. Solving W and w₀ requires choosing a cost function that measures the independence of the d source signals.

Nongaussianity is often used as the cost function.

Feature selection

A brute-force examination of all possible subsets of some larger feature set is an NP-hard problem. Therefore, either sequential or randomized heuristic search algorithms are used in feature selection. A classiﬁer or some global statistic computed from the data is often employed to evaluate each selected feature subset. The feature selection process can go forward or backward or in both directions.

1. A classical measurement of feature goodness is information gain ratio, a criterion from the decision tree theory. Given a data setDwithmfeatures, the information gain ratio of featureX_i is deﬁned as [275]

Gain(D, Xi) =Entropy(D)−

v∈Vi

DEntropy(D_v)

−

v∈Vi

Dlog^D_D^v ,

i= 1,2, . . . , m (8.15)

whereV_i is the set of all possible values thatX_i can have andD_v represents the data subset in whichX_i has the value of v. The gain ratio of a feature measures how much more information will be gained by splitting a decision tree node on this feature. It is more advantageous than normal information gain because it penalizes features that are diﬀerent at every data point. A simple ranking of features by their gain ratios can help identify the “best”

features.

2. Every data set has more or less self-similarity, which can be measured by its intrinsic dimensionality. The intrinsic dimensionality of a self-similar data set should be much less than the actual dimension in its feature space. Therefore, those features that do not contribute to the intrinsic dimensionality are candidates to be dropped. One way to determine the intrinsic dimensionality of a data set is to compute its fractal dimensionality, also known as correlation fractal dimensionality [407]. A feature selection scheme can be formed by considering the goodness of each feature by measuring how much it will contribute to the fractal dimensionality. An algorithm, FDR (fractal dimensionality reduction), implements this idea by employing a backward elimination process where the feature whose deletion changes the fractal dimensionality the least gets dropped each time until no more features can decrease the total fractal dimensionality by a minimum amount [407]. This algorithm can be used for both labeled and unlabeled data, and it also gives an approximate ﬁnal number of features we should keep, which is the fractal dimensionality of the original data.

3. In a well-configured classification problem, it can be found that different classes are far apart from each other in the feature space where each class is also tightly packed. The job of a classifier is made much easier by features that have this property. Stepwise discriminant analysis [203] uses a statistic, Wilks’sΛ, to measure this property of a feature set. It is defined as

Λ(m) =|W(X)|

|T(X)|, X = [X₁, X₂, . . . , X_m], (8.16) where X represents the m features currently used and the within-group covariance matrixW and the among-group covariance matrix T are deﬁned as whereiandj represent theith andjth features,Xigtis theith feature value of the data pointt in the class g, ¯X_ig is the mean value of the ith feature

166 Data Mining in Bioinformatics

in the classg, ¯X_i is the mean value of theith feature in all classes,qis the total number of classes, andn_g is the number of data points in the class g.

Since Wilks’sΛ is a group statistic, an F statistic is often used to convert it to the conﬁdence of including or removing a feature for the current feature set. A feature selection process can be formed according to the F statistic at each step by starting from the full feature set.

All these methods provide a criterion for adding or subtracting a feature and follow a sequential, deterministic path through the feature space. This search can be either forward (starting with no features and adding one at each step), backward (starting with all features and removing one at each step), or forward-backward, in which we choose whether to add or subtract at each step. Both the forward and the backward methods are greedy and therefore limited in the number of possibilities considered (making them eﬃcient). The forward-backward method is one order less greedy so that initial, nonoptimal inclusions of features can be reversed.

4. As an alternative to these deterministic methods, we can incorporate random choice into a search strategy. A genetic algorithm is often used for this purpose [440]. It treats each possible feature subset as a bit string, with 1 representing inclusion of the feature. The initial pool of bit strings is randomly generated, and all strings go through mutation and crossover at each generation. At the end of each generation, a classifier is applied as the fitness function to rank all feature subsets at that generation according to their prediction errors. The feature subsets giving lowest error are selected as well as some lower-performing subsets that are selected under predefined probability. The selection process will stop if either the maximum number of generations is reached or no more improvement can be observed between generations. Figure 8.7 shows an outline of the genetic algorithm approach.

Fig. 8.7.Flow chart of feature selection using genetic algorithms.

Classiﬁers

Support vector machines (SVMs)

Support vector machines are generalized linear classiﬁers. A linear classiﬁer looks for a hyperplane, a linear decision boundary, between two classes if one exists. It will perform badly if the optimal decision boundary is far from

linear. Sometimes there are many possible linear classifiers to separate two classes making the choice between them difficult. Therefore, there are two constraints in classical linear classifiers, namely the linear decision boundary constraint and the optimal choice among several candidates. Support vector machines were designed to solve these two problems in linear classifiers.

The same kernel trick used in KPCA is applied in support vector machines.

The original feature space is transformed to a very high, sometimes infinite, dimensional space after a kernel mapping. The nonlinear decision boundary in the original feature space can be close to linear in the new feature space by applying a nonlinear kernel function. To choose the optimal linear boundary in the new high-dimensional space, a support vector machine selects the maximum-margin hyperplane that maximizes the minimum distance between the training data and the hyperplane. Intuitively, this hyperplane will prevent overfitting by reducing its representation to a small number of data points lying on the boundary. The maximum-margin criterion was proved to minimize the upper bound on the VC dimension of a classifier, an objective goodness measurement of a classifier [410].

Support vector machines can model very complex decision boundaries in the original feature space through the kernel trick. A kernel function K is deﬁned as the inner product of two data points in the new feature spaceΦ (Equation 8.12) and should satisfy Mercer’s conditions [158]:

K(x, x) = As reviewed before [158], the ﬁnal discriminant function can be represented as located at the boundary of the maximum margin satisfying

αr, αs>0, yr=−1, ys= 1.

This system can be solved as a constrained quadratic programming problem,

168 Data Mining in Bioinformatics

with constraints

C≥α_i≥0, i= 1, . . . , l l

j=1α_jy_j = 0

Diﬀerent kernel functions are available, such as linear, polynomial, rbf, exponential-rbf, and neural network kernels. The kernel parameters can be selected by cross-validation:

K(x_i, x_j) =x_i, x_y linear kernel K(x_i, x_j) = (x_i, x_j+ 1)^d polynomial kernel K(xi, xj) =exp

−^xⁱ_2σ⁻^x2^j²

radial basis kernel K(xi, xj) =exp

−^xⁱ_2σ⁻2^x^j

exponential radial basis kernel

To expand the binary SVM to K-class SVM, three methods are often used [231, 318, 411]. The max-win strategy creates K binary SVMs, each distinguishes class i versus non-i. The class that has the highest score will be selected as the predicted target. The pairwise strategy creates a total of K(K−1)/2 binary classifiers between every pair of classifiers and each classifier gets one vote. The predicted target will be the class that gets the most votes. The DAG (directed acyclic graph) strategy puts theK(K−1)/2 binary classifiers in a rooted binary DAG. At each node, a data point is classified as non-iif the classi loses. The predicted target is the one that is left after tracing down the tree from the root.

AdaBoost

As shown in support vector machine learning, not all training data are equally useful for forming the decision boundary. Some of the training data are easily distinguished and some require a ﬁnely tuned boundary. AdaBoost is a classiﬁer that manipulates the weights on the training data during training.

It employs a base learner generator that generates a simple classifier such as a neural network or decision tree that is trained with a differently weighted set of training data at each iteration. More weight will be put on the wrongly classified data points and less weight on the correctly classified data points.

Therefore, each base classifier is trained toward those hard examples from the previous iteration. The final classifier merges all base learners under some weighting scheme. Following is the AdaBoost training process [349].

Given a binary classiﬁcation problem with m two-dimensional data points:

(x₁, y₁), . . . ,(x_m, y_m), wherex_i∈X, y_i∈Y ={−1,+1}. Uniform weight for each data pointD₁(i) =_m¹.

Fort= 1, . . . , T:

Train the rule generator using distributionD_t. Generate base ruleh_t:X → .

Chooseαt∈ . Update:

Dt+1(i) = Dt(i)exp(−αtyiht(xi))

Z_t , (8.22)

where Z_tis the sum of all numerators such thatD_t+1 represents a probability.

Each base classiﬁer tries to minimize the training error_t, where _t=P r_i_∈_D_t[h_t(x_i)=y_i],

and the weightαtassociated with each data point can be updated as

α_t=1

2ln(1−_t _t ).

The ﬁnal discriminant function is H(x) =sign

t=1

α_th_t(x)

. (8.23)

Similar to SVM, AdaBoost was designed as a binary classiﬁer and several multiclass variants have been made [132, 350].

Bagging

Bagging, also called bootstrap aggregation, is a classifier that bootstraps an equally weighted random sample from the training data at each iteration [104]. Unlike AdaBoost, which pays special attention to previously hard examples, bagging works by averaging performances of a base classifier on different random samples of the same training data. It has been shown that some classifiers such as neural network and decision trees are easily skewed by small variations in the training data. By averaging out the random variances from repetitive bootstrapping, the base classifier will be more robust and therefore give more stable prediction results. The outputs of the resulting base classifiers from all iterations are finally averaged to give the prediction H(x):

H(x) =sign(

T t=1

ht(Xt)/T), (8.24)

170 Data Mining in Bioinformatics

where X_t is a bootstrap sample from the training data, h_t(X_t) is a binary classiﬁer learned from the sampleX_t, andT is the total number of iterations.

Mixtures-of-Experts

A mixtures-of-experts classifier employs a divide-and-conquer strategy to assign individual base classifiers to different partitions of the training data [197, 424]. It models the data generation process as

P(Y |X) =

P(Z|X)P(Y |X, Z), (8.25)

whereY stands for the targets,X represents the input variables, andZ is a hidden variable representing local experts related to each data partition. The

Dans le document Advanced Information and Knowledge Processing (Page 163-179)