The PSSP Basics - Advanced Information and Knowledge Processing

A Hybrid Neural Network For Protein Secondary Structure Prediction

8.1 The PSSP Basics

In this chapter, we will use hybrid neural networks to deal with the protein Secondary Structure Prediction (PSSP) task.

8.1.1 Basic Protein Building Unit — Amino Acid

Aprotein sequence is an array of amino acids, which is called a primary pro-tein structure. Each amino acid is encoded by three out of four DNA bases, i.e., A, C, T, and G. The amino acids are the basic units of protein sequences and are referred to asresidues. The triplet code implies that there are 4³= 64 possible permutations. However, there are only 20 amino acid types and this results in a redundancy in the genetic code. Thus, almost each of the amino acids (with the exceptions of Methionine and Tryptophan) is encoded by syn-onymous permutations which are interchangeable in the sense of producing the same amino acid. For convenience of presentation, each amino acid type is represented by an alphabetic letter. For example, the amino acid named Ala-nine is represented by the letter ‘A’. A protein sequence in the alphabetical representation is thus a long sequence of characters, as in the example shown in Fig. 8.1. A protein sequence may be subject to evolutionary changes that may induce mutations, including insertions, deletions, or substitutions, to the original protein sequence and thereafter produce diﬀerent functions.

8.1.2 Types of the Protein Secondary Structure

Secondary structures are regular structural elements which are formed by hy-drogen bonds between relatively small segments of the protein sequence. Often the driving force for the formation of a secondary structure is the saturation of backbone hydrogen donors (NH) and acceptors (CO) with intramolecular

190 8 A Hybrid Neural Network For Protein Secondary Structure Prediction

Name : Complex Of Troponin C With A 47 Residue (1-47) Fragment Of Troponin I PDB ID : 1A2X:B

Sequence :

1) GDEEKRNRAI TARRQHLKSV MLQIAATELE KEEGRREAEK QNYLAEH 2) GDEEKRNRAI TARRQHLK _ _ MLQIAATELE KEEGRREAEK QNYLAEH 3) GDEEKRNRAI TARRQHLKSV MLQIAATELEFFE KEEGRREAEK QNYLAEH 4) GDEEKGFRAI TARRQHLKSV MLQIAATELE KEEGRREAEK QNYLAEH

Note 1) Original Protein Sequence (47 Residues)

2) Deletion: several amino acids deleted from the chain

3) Insertion: amino acids FFE was inserted into the original sequence 4) Substitution: the replacement of amino acids segment by GF

Fig. 8.1. Alphabetical representation of the primary protein sequence and protein mutations

hydrogen bonds. This saturation allows the protein to bury hydrophobic side-chains in its interior (hydrophobic core) without conﬂicting with the polar backbone. There are three common secondary structures in proteins, namely α-helix,β-strand, andcoil.

α-helix

An α-helix is formed from a connected stretch of amino acids. The α-helix is characterized by hydrogen bonds along the chain, which are almost co-axial. Theα-helix is the most abundant helical conformation found in globular proteins. The average length of anα-helix is around 10 residues.

β-strand

Aβ-strand is the principal component of aβ-sheet. Theβ-sheet is character-ized by hydrogen bonds crossing between chains. Each participatingβ-strand in aβ-sheet is not continuous in terms of the primary sequence and does not even have to be close to another β-strand in the sequence. A β-strand has a sequence of 5-10 residues in a very extended conformation.

Coil

Approximately one-third of all residues in globular proteins are contained in coils. The coils in a protein serve to reverse the direction of the polypeptide chain. Coils vary in length.

8.1 The PSSP Basics 191 Fig. 8.2 illustrates the visualized secondary structures of the protein.

In the diagram shown, the dark ribbons represent helices. The gray ribbons areβ-strands that form theβ-sheet. The spring-like strings in between these two secondary structures are the coils that bind them.

Fig. 8.2. Three types of the protein secondary structure: α-helices are the dark ribbons on the boundary of the diagram, the gray ribbons in the center are the β-strands that form theβ-sheet, the coils are the spring-like strings that bind the α-helix and theβ-strand.

8.1.3 The Task of the Prediction

The term ‘prediction’ in the protein secondary structure prediction (PSSP) domain carries a similar meaning as the data mining term ‘classiﬁcation’:

given a residue of the protein sequence, the predictor should classify it into one out of three secondary structure states based on certain characteristics of the residue. Note that the outcome of the prediction is a state of the secondary structure rather than the secondary structure itself. One residue is only the constitutional element of a secondary structure. A protein secondary structure consists of several residues sharing the same secondary structure state. In other words, the secondary structure state is associated with one amino acid while the secondary structure is for an ensemble of amino acid residues. In the literature, the prediction of the protein secondary structure can be conducted in two stages [164][255][260][262]: thesequence-structure (Q2T) predictionand thestructure-structure (T2T) prediction.

Sequence-Structure Prediction

The Q2T prediction predicts the protein secondary structure from the primary protein sequence of amino acid residues. Given a protein sequence, the Q2T

192 8 A Hybrid Neural Network For Protein Secondary Structure Prediction predictor maps each entry of the sequence to the relevant secondary structure state by using the data representation information of the input residue. The data representation attempts to capture information related to the type of the amino acid, the sequence context (that is, what are the neighboring residues of the input), the evolutionary information, etc. The sequence-structure pre-diction makes a major contribution in the prepre-diction in terms of accuracies.

Structure-Structure Prediction

As deﬁned previously, a secondary structure is an ensemble of consecutive amino acid residues sharing the same secondary structure state. The neigh-boring sequence positions usually present some correlation characteristics in terms of the secondary structure formation. For example, it is usually ob-served that anα-helix contains at least three consecutive amino acids which are all in the α-helix state. Suppose that there are alternative occurrences of theα-helix and the β-strand states (i.e., αβαβ...) in the outcome of pre-dicted secondary structures; then this prediction must be wrong. The above mentioned example is only a simple example of the correlations existing in the neighboring residues. There are also other correlations that are known or unknown. Therefore, the T2T prediction, which is based on the outputs of the ﬁrst stage, is needed. This is the second stage of prediction. This stage of prediction attempts to correct unrealistic predictions from the previous stage and thus enhances the overall prediction accuracy. This stage of prediction is the complementary to the sequence-structure prediction.

Fig. 8.3 illustrates the scenario of the secondary structure prediction with two stages of implementations.

It is important to note that the same type of amino acids needs not to be predicted to belong to the same secondary structure state in the secondary structure prediction. For instance, in Fig. 8.3, the 12th and the 20th amino residues counted from the left-hand side are both of type ‘F’ yet are assigned to two diﬀerent secondary structure states. It is rather the distinct characteristics embedded within the residue such as the sequence context, the evolutionary information, biochemical properties, speed of translation, etc., that play a more signiﬁcant role in the formation of the protein secondary structure.

It is acknowledged that the neighboring residues have an impact on the predictive capability of the secondary structure. Therefore, the prediction of the secondary structure at each sequence position should not solely rely on the residue at that position. Rather, a window expanding towards both directions of the residue should be used to include the sequence context. We will discuss the issue in detail later.

Dans le document Advanced Information and Knowledge Processing (Page 195-199)