Chapter 5 Protein mutations and free energy changes

(1)

179

Protein mutations and free energy changes

Proteins are the most abundant biological macromolecules. They are essential parts of all living organisms and participate in every process within cells (Creighton, 1993, Lehninger et al., 1993); they provide structure, catalyse cellular reactions, defend organisms against injury, transport specific molecules like oxygen and carry out a multitude of other tasks. However, this myriad of protein functionalities depends on the combination of only 20 different basis molecules called amino acids. The amino acid sequence specifies the spatial organization of the protein; it implies the three-dimensional protein structure which carries out the macromolecule function. The stability of the three- dimensional protein structure is thermodynamically characterized by a free energy, called folding free energy.

Proteins are largely used in the industrial world where their properties are exploited for the design of vaccines as well as in the agro-alimentary field.

However, it can be interesting to tune certain physicochemical or biological properties thanks to the substitution of amino acids by others. Such mutations imply, for instance, the increase of the protein solubility as well as the maintenance of their activity under unusual pH or temperature conditions.

Whatever the modified property, one has to check if the considered mutations do

not alter the protein structure and stability too much. Indeed, structure and

stability deteriorations can lead to the loss of the main protein function. Of

(2)

course, the experimental determination of the change in folding free energy between wild-type and mutant proteins leads to the most reliable stability information, but it is time consuming and does not allow to test all possible mutations in a protein. Among several theoretical methods predicting stability changes caused by mutations, the PoPMuSiC (Prediction Of Proteins Mutations Stability Changes) program has been developed within the Genomic and Structural Bioinformatics Group of the Université Libre de Bruxelles. This software allows to predict, in silico, changes in thermodynamic stability of a given protein under all possible single-site mutations, either in the whole sequence or in a region specified by the user. The program returns a list of the most stabilizing or destabilizing substitutions or of mutations that do not have any effect on the stability (Gilis and Rooman, 2000; Kwasigroch et al., 2002).

However, PoPMuSiC suffers from limitations and can be improved thanks to recently developed techniques of protein energy evaluation like the statistical mean force potentials of Dehouck et al. (2006). Our work proposes to enhance the performance of PoPMuSiC by the combination of the new energy functions of Dehouck et al. (2006) and artificial neural networks.

The first subsections introduce notions about proteins and their structure.

Subsection 5.3 deals with energetic functions able to evaluate the protein

stability. This kind of functions is used in subsection 5.4 which presents

different predictive methods estimating the stability changes caused by

mutations. The improvement of the PoPMuSiC program being the aim of this

work, the software is presented in detail with its limitations. Finally, section 5.5

exposes the proposed improvement of PoPMuSiC: the mean force potentials

defined by Dehouck et al. (2006) are combined with neural networks in order to

model the folding free energy changes. Two different kinds of neural networks

are considered: the MultiLayer Perceptron and the Radial Basis Function

network as in chapter 3. Some conclusions are drawn in subsection 5.6.

(3)

5.1 Protein Structure

As we mentioned in the introduction of this chapter, proteins are organic macromolecules that intervene in every process within cells (Lehninger et al., 1993). Hence, proteins are able to transport molecules or ions, catalyse chemical reactions, provide structural rigidity to the cell, cause motion and control flow of material through membranes, concentrations of metabolites as well as gene function (Lodish et al., 1995). In spite of this functional diversity, proteins are linear polymers that are constructed from only 20 basis units called amino acids.

Such molecules are monomers that can be represented as in figure 5.1. In this figure, the α carbon atom C

α

is bonded to four different chemical groups: the carboxyl group (COOH), the amino group (H

2

N), an hydrogen atom and finally the side chain R which varies in size, shape, charge, hydrophobicity and reactivity for each of the 20 existing amino acids. Hence, amino acids are characterized by their side chain; the other atoms are identical for every amino acid and form the protein backbone.

Figure 5.1: Amino acid

Figure 5.2 presents the structure of the twenty different amino acids; they are classified according to their solubility in water. Hence, we distinguish hydrophilic and hydrophobic molecules. However, three amino acids (glycine, cysteine and proline) have special characteristics; they have a particular influence on the protein structure as we will see later in this section.

H

₂

N C

α

H

R C O

OH

(4)

Figure 5.2: structures of the 20 amino acids grouped according to three categories:

hydrophilic, hydrophobic and special amino acids (Lodish et al., 1995)

(5)

In a protein, amino acids are connected by a covalent linkage called the peptide bond. The linear chain produced by the connection of amino acids is called polypeptide. The peptide bond is formed by a condensation reaction between the amino group of one amino acid and the carboxyl group of another, a water molecule is released in the process. This principle is illustrated by figure 5.3.

Figure 5.3: formation of a peptide bond

The amino acids joined by peptide bonds are usually called residues. The rotation of these residues around a peptide bond is limited compared with rotation around a typical C-N single bond. The reason of this observation is the partial double bond character due to the electron delocalization within the carboxyl group. In consequence, the atoms of the peptide bond and its adjacent C

α

atoms lie in the same plane. However, adjacent amino acid residues are not necessarily coplanar; they can rotate around the C

α

-C and N-C

α

bonds (figure 5.4). These rotations contribute to the flexibility of the polypeptide chain and lead to the protein three-dimensional structure.

X-ray crystallography analysis has shown four different hierarchical levels of

+

H

3

N C

α

H

R

1

C O

O

^-

+

⁺

H

3

N C

α

H

R

2

C O

O

^-

H

2

O

N C

α

H

R

2

C O

O

^-

+

H

3

N C

α

H

R

1

C O

H

Peptide bond

(6)

Figure 5.4: flexibility of bonds in the polypeptide backbone (Lodish et al., 1995)

protein structures (Lodish et al., 1995). The primary structure is the linear arrangement, or sequence, of amino acid residues that constitutes the polypeptide chain. Protein sequencing determines the number and order of residues composing the concerned protein. Proteins can contain a variable residue number ranging from around fifty to several thousands. The primary structure allows to identify a protein without any doubt and contains information necessary to the adoption of a specific structure and the execution of its biological function.

Secondary structure refers to local organization of parts of a polypeptide

chain, i.e. to arrangements in space of adjacent amino acid residues. A single

polypeptide may exhibit different kinds of secondary structures. Hence, it

assumes a random-coil organization when stabilizing atom interactions are

missing. However, when stabilizing hydrogen bonds are observed between

certain residues, the protein backbone folds into one or two geometric

arrangements. First, in the α helix (figure 5.5), the carboxyl oxygen of each

peptide bond is hydrogen bonded to the amide hydrogen of the amino acid

placed 4 residues further. Hence, the peptide backbone twists into a helix having

(7)

Figure 5.5: model of the α-helix with, in dotted lines, hydrogen bonds (Lodish et al., 1995)

3.6 amino acids per turn. Each step of this spiral staircase, the axial rise per residues, represents a distance of 0.15 nm along the axis. Certain amino acid sequences adopt α-helical conformations more readily than others. On the other hand, certain amino acids are rarely found in α-helical regions. For instance, proline does not appear in such a structure because of the cyclic ring of its side chain (figure 5.2). This ring makes proline very rigid and involves the lack of an available amide nitrogen to make a stabilizing hydrogen bond.

Another secondary structure stabilized by hydrogen bonds is the β-sheet. It

consists of laterally packed β-strands which can be oriented antiparallel or

parallel with respect to each other. Each strand is a short (5-8 residues) nearly

fully extended polypeptide chain. Its backbone atoms form hydrogen bonds with

the adjacent strand in order to produce a β-sheet. Figure 5.6 proposes an

example of an antiparallel β sheet.

(8)

Figure 5.6: two-stranded β-sheet with antiparallel β-strands (Lodish et al., 1995)

Other secondary structures can be observed within a polypeptide chain.

Hence, turns are U-shaped structures that are composed of three or four residues.

Hydrogen bonds stabilize the structure at the end residues of the turns. They are located on protein surfaces and form sharp bends that redirect the protein backbone back toward the interior. Larger bends or loops are contemplated in polypeptide chains too.

The next higher level of structure, the tertiary structure, refers to the spatial relationship among all amino acids in a polypeptide. While the secondary structure is stabilized by hydrogen bonds, tertiary structure essentially results from hydrophobic interactions between the side chains. These effective interactions hold the helices, strands and random coils in a compact internal arrangement. As for the quaternary structure, it concerns proteins with several polypeptide chains. It refers to the spatial relationship of the different protein subunits.

The three-dimensional structure can be represented according to different

manners, which convey different types of information. The simplest way to

represent three-dimensional structure is to trace the course of the backbone

atoms with solid ribbons (figure 5.7b). This representation shows the overall

organization of the polypeptide chains without consideration of the amino acid

(9)

side chains. The most complex model shows the location of every atom, it details the interactions among atoms that form the backbone and that stabilize protein conformations (figure 5.7a). Regarding the last representation kind, it is a space-filling model which shows the texture of the protein surface and the distribution of charge. This view represents a protein as seen by another molecule (figure 5.7c).

Figure 5.7: representations of triose phosphate isomerase

5.2 Native conformation and folding free energy

As we have explained in the previous subsection, the biological function of a protein is determined by its amino acid sequence which specifies the spatial organization. The native conformation of a protein is the specific three- dimensional structure which allows to carry out the biological function. It usually corresponds to the most stable form of the protein, characterized by a strongly negative free energy. However, obtaining this particular structure is not an easy task. When the amino acid sequence is translated from the genetic code, the protein has to find its native conformation among numerous alternative structures. It has to assemble itself, to fold up appropriately within an over- populated cell environment.

a b c

(10)

Whereas most proteins fold up spontaneously into their native state in vitro, the remarkable cell velocity in promoting protein folding in vivo probably lies in the so-called chaperone proteins. These chaperones are located in every cell compartment in order to facilitate the protein folding in an over-populated environment. They bind and stabilize unfolded and partially folded proteins in order to prevent them from damage. They help proteins to reach their most stable folded form.

5.3 Native conformation prediction and energetic functions

Since it was highlighted that proteins adopt specific and well defined conformations, numerous research projects have been devoted to the theoretical study of protein folding and the development of methods able to predict native conformation of proteins (see for example Jayachandran et al., 2006 or Arunachalam et al., 2006). Indeed, such studies allow to better understand how proteins fold so quickly and reliably or attempt to determine the protein structure without using experimental methods like X-ray crystallography or nuclear magnetic resonance.

As the native conformation of a protein corresponds to its most stable folded form characterized by a strongly negative folding free energy, all the applications used to study proteins in silico rely on an energetic function able to evaluate the adequacy between given sequence and structure. Such an energy function has to be able to discriminate the native conformation among other structures. Two main classes of energy functions have been developed: the semi- empirical potentials and the data-derived ones.

The first potential kind corresponds to analytic expressions describing the

interactions between the different atoms in a protein (Lazaridis and Karplus,

(11)

2000, Gourlay et al., 2007). They are established thanks to quantum mechanical computations or experimental results obtained on the basis of small molecules (Brooks et al., 1983). The advantage of such semi-empirical potentials lies in the physical meaning associated to well defined interactions. However, they have to be combined with detailed protein description at the atomic level which can lead to time consuming computations.

The alternative to the semi-empirical potentials is the data-derived ones.

Such energy functions attempt to extract information about residue interactions from available structural data. Two approaches can be considered in order to derive such potentials from databases. The first potential kind consists of analytical expressions optimized in order to maximize the energetic deviation between native conformation and alternative structures (Hao and Scheraga, 1996; Pillardy et al., 2002; Loose et al., 2004). The second potential kind is derived from observed relative frequency of small sequence and structure patterns in the considered database (Rooman and Wodak, 1995; Russ and Ranganathan, 2002; Buchete et al., 2004). Such statistical functions evaluate, for instance, the propensities of residues to be associated with certain values of the backbone torsion angles (figure 5.4) or to be separated by a certain spatial distance. In contrast to the semi-empirical potentials, data-derived ones are appropriate for every protein description, detailed or not. However, their hazy physical meaning frequently involves validity questioning.

All these kinds of energetic functions are not only used for theoretical studies

on protein folding or prediction of native conformation. Indeed, as they evaluate

protein energy, they are also considered to estimate the stability changes caused

by mutations as we explain in the next section.

(12)

5.4 Protein mutations and stability changes

Protein design and analysis widely incorporate point mutations. The study of such sequence alterations allows a better understanding of the relationships between sequence and structure. For instance, thanks to the experimental determination of stability changes caused by mutations, Fersht and Serrano (1993) have studied the modification of the interactions that stabilize tertiary structures. They highlight the dominating influence of hydrophobic effects in the protein core even though hydrogen bonds or electrostatic interactions are non- negligible.

On a practical level, protein design has numerous applications related to the modification of some specific physicochemical or biological properties. In industrial processes, mutations at specified positions are used, for instance, to design proteins with higher solubility or maintaining activity under non- physiological pH or temperature conditions. Nevertheless, whatever the modified property, the mutant structure and stability have to be checked. Of course, the experimental determination of the change in folding free energy between wild-type and mutant proteins leads to the most reliable information.

However, it is time consuming and, thus, cannot be used to test all possible mutations in a protein. This is why predictive methods are developed. They allow to reduce the number of mutations to be tested experimentally.

The next subsection proposes a non-exhaustive retrospective of the existing

predictive methods of stability changes caused by mutations. Among the various

presented tools, PoPMuSiC has been developed by members of the Genomic

and Structural Bioinformatics Group of the Université Libre de Bruxelles. This

tool is described in detail further in subsection 5.4.2.

(13)

5.4.1 Stability prediction methods

Only a few theoretical methods have been developed to estimate stability changes caused by mutations. The first ones are based on detailed atomic models combined with semi-empirical potentials (Basch et al., 1987; Tidor and Karplus, 1991). However, such methods are computer time consuming so that it is difficult to test a large set of mutations. In order to avoid such a problem, faster methods have been developed; they rely on rougher descriptions of the protein structure. Such predictive methods are based on statistical potentials derived from known protein structures; in particular, hydrophobic potentials (Koehl and Delarue, 1994), secondary structure potentials (Muñoz and Serrano, 1994), residue contact potentials (Miyazawa and Jernigan, 1994) or distance- dependent residue-residue interaction potentials (Sippl, 1995) are currently used.

Still others link stability changes to shape, flexibility and volume of the substituted amino acids (van Gunsteren and Mark, 1992), to the number of some molecules in the environment of the mutated residues (Serrano et al., 1992), to the number of surrounding α-carbon atoms (Shortle et al., 1990) or to the cavity formation in the protein interior resulting from substituting a large into a smaller amino acid (Eriksson et al., 1992). The performances of these methods are usually evaluated thanks to the comparison between the estimated folding free energies to the measured ones. They are reasonably good. However, most of the above mentioned methods have been applied to mutations of residues buried in the protein core, where hydrophobic effects have a dominating influence.

Moreover, the performance tests are restricted to selected mutations in a single protein, usually even at a single site. Are these methods general and valid to reproduce stability changes caused by mutations on any point in every protein?

The answer is negative: as soon as mutations in different sites and diverse

proteins are mixed, the correlations between simulated and measured free

energy changes seem to break down.

(14)

To propose a more general prediction model, the Genomic and Structural

Bioinformatics Group of the Université Libre de Bruxelles has studied the

influence of various residue interactions in different ranges of solvent-

accessibility, i.e. in different depths in the proteins. Hence, Gilis and Rooman

(1996) consider single-site mutations of residues that are solvent-accessible and

compute the change in folding free energy using a linear combination of

different types of data-derived potentials, assuming that the backbone

conformation remains unchanged upon mutations. The performance of this first

model is measured by the linear correlation coefficient between the computed

and measured changes in folding free energy. This correlation coefficient is

quite good; they obtain 0.87 on a subset of 96 out of 106 mutations, in six

different proteins at different sites and in diverse secondary structures. 10 hardly

modelled mutations are eliminated from the considered database since they are

suspected of modifying the backbone conformation. In this paper, the authors

highlight that local interactions along the sequence dominate at the surface of

the protein. Afterwards, Gilis and Rooman (1997) analyse the influence of

various interactions for partially or totally buried mutated residues. Thanks to a

database constituted of 190 mutants, they confirm the domination of

hydrophobic interactions in the core while the local interactions are non-

negligible. The correlation coefficient of this model is equivalent to 0.8 in

simple validation. On the basis of these two important researches, a program has

been developed (Gilis and Rooman, 2000; Kwasigroch et al., 2002) and is

available online on http://babylone.ulb.ac.be. PoPMuSiC (Prediction Of Proteins

Mutations Stability Changes) evaluates the changes in stability of a given

protein under all possible single-site mutations, either in the whole sequence or

in a region specified by the user. Moreover, it returns a list of the most

stabilizing or destabilizing mutations or of the ones that do not affect stability. It

has been tested on 344 experimentally studied mutations introduced at 132

different sites in seven different proteins.

(15)

Since 2002, other predictive methods have been developed. Hence, Guerois

et al. (2002) obtain a computer algorithm based on a set of semi-empirical

potentials that describes physical interactions contributing to the stability of

proteins. The linear correlation coefficient obtained for a database of 1088

mutants is equal to 0.83. However, the authors demonstrate a better quality of

the statistical approach of Gilis and Rooman (2000) for solvent-accessible

mutated residues. Indeed, for such an accessibility range, the model of Guerois

et al. (2002) presents a correlation coefficient of 0.68, a much lower correlation

than Gilis and Rooman (1996). Khatun et al. (2004) test the ability of contact

potentials to predict stability changes. Such energy functions are simplified

inter-residue potentials in which amino acids interact only if they are located

spatially within a certain distance from each other. However, the results of this

study suggest that it is not possible to predict accurately the stability changes for

a set of proteins thanks to contact potentials only. Capriotti et al. (2004) propose

a neural-network-based method to predict if a given mutation increases or

decreases the protein thermodynamic stability with respect to the native

structure. However, even if the predictor classifies correctly more than 80% of

the 1615 tested mutations, the used neural network, a MultiLayer Perceptron

with one 2-nodes hidden layer and one output node, does not give any

estimation of the free energy changes. Moreover, with its 43 inputs of which one

is the solvent-accessibility of the mutated residue, the predictor is a black box

without physical meaning. Nevertheless, to answer the first mentioned default of

this method, Capriotti et al. (2005) adapt their model to be able to predict a free

energy change value. Although the correlation coefficient of the predictor is less

good (0.71) when structural information is available, the model is also able to

predict stability changes when only the amino acid sequence is known and this

with a correlation coefficient of 0.62. As for Parthiban et al. (2006) and

Parthiban et al. (2007), they propose to model free energy changes thanks to

statistical potentials similarly to Gilis and Rooman (2000). The advantage of this

model is the number of experimental data used to test it: 3141 mutants in 101

(16)

proteins. Finally, Huang et al. (2007) use a set of 48 physical, chemical energetic and conformational properties of amino acid residues to estimate the stability changes. These properties are quantified and the computation of their differences for each mutant is related to protein stability.

Among the numerous tools mentioned in the above non-exhaustive retrospective, PoPMuSiC seems to be appreciated. It is cited by a lot of specialists (Guerois et al., 2002; Capriotti et al., 2005, Parthiban et al., 2006).

However, it suffers from some limitations and must be improved. This is the aim of this work. But before explaining the improvements that we propose, let us begin with the description of the principles of PoPMuSiC.

5.4.2 PoPMuSiC

Once the wild-type protein is provided to PoPMuSiC in the Protein Data Bank format (e.g. 1ank for E. coli adenylate kinase (Berman et al., 2000)), the algorithm evaluates all the possible single-site mutations in the whole amino acid sequence or a region specified by the user. On the basis of stability change estimations, the program returns a list of the most destabilizing, stabilizing or neutral mutations according to the user’s wishes.

As PoPMuSiC uses stability change estimations to classify mutations, it relies on a model of free energy changes. This model consists of linear combinations of statistical potentials, which differ according to the solvent- accessibility range of the mutated residue. Therefore, data-derived potentials have been selected. First, distance potentials describe interactions between pairs of residues; they model among others hydrophobic and electrostatic interactions.

Secondly, torsion potentials concern local interactions along the sequence; they

describe the correlations between residue types and backbone torsion angle of

residues close along the sequence (figure 5.4). Consequently, they depict the

secondary structure.

(17)

The next section reminds the form of statistical potentials before explaining how to derive them from a database of known protein structures. Afterwards, section 5.4.2.2 describes the selected potentials based on a specific protein representation, while section 5.4.2.3 deals with the linear combinations used to model stability changes.

5.4.2.1 Data-derived potentials

Since many years, experimental structural information increases continuously. In order to exploit this important information source to predict protein structure, many authors attempt to extract statistical relationships from databases of known protein structures. Indeed, from such relationships, can be deduced energetic functions able to describe the complex set of interactions which are observed in proteins. Often called data-derived, such potentials can be clustered in two groups. The first potential kind corresponds to analytical expressions optimized in order to maximize the energetic deviation between native conformation and alternative structures (Hao and Scheraga., 1996;

Pillardy et al., 2002; Loose et al., 2004). The second potential type is derived from observed relative frequencies of small sequence and structure patterns in a database of known protein structures (Rooman and Wodak, 1995; Russ and Ranganathan, 2002; Buchete et al., 2004).

In order to avoid imposing any prior knowledge about the existing interactions within proteins, the authors of PoPMuSiC have rejected the first potential kind based on analytical expressions and have selected statistical potentials which have the nature of a free energy.

In order to derive the considered statistical potentials from known protein

structures, observation probabilities of sequence and structure patterns have to

be computed. This involves the definition of sequence and structure elements.

(18)

According to the selected protein description, protein sequence S is divided into sequence elements s as residues, pair of residues or residue triplets and conformation C is divided into structural states c as ranges of backbone torsion angles or inter-residue distances. On the basis of the described sequence and structure elements, a statistical potential is given by

) ( ) (

) , ln ( )

, (

j i

j i j

i

P c P s

s c kT P

s c

G = == = − −− −

∆ ^(5.1)

where the indices i and j indicate the positions along the sequence of the structural state c and sequence element s. k and T represent the Boltzmann constant and the conformational temperature. As for the probabilities, P(c

i

) and P(s

j

) are the observation probabilities of, respectively, c

i

and s

j

in the considered dataset while P(c

i

,s

j

) is the joint observation probability of c

i

and s

j

. Note that these probabilities can be approximated by the relative observation frequencies F within the considered database.

) ( ) (

) , ln ( )

, (

j i

j i j

i

F c F s

s c kT F

s c

G ≈ ≈≈ ≈ −− − −

∆ ^(5.2)

From such a potential, an estimation of the folding free energy of a protein constituted of N residues, with a sequence S and a conformation C is given by

∑ ∑

≈ ≈≈

≈

j i

j j

s c G S

C G

,

) , ( )

,

( ∆

∆ ^(5.3)

This sum of different contributions related to the numerous couples ( c

_i

, s

_j

) is a strong approximation of the folding free energy since these contributions are not necessarily independent as we will see in section 5.5.1.1.

As for the interpretation of the equations (5.1), (5.2) and (5.3), ∆ G ( C , S )

corresponds to the energy difference between the observed combinations of

(19)

structure and sequence elements in the considered protein and a reference state where structure and sequence are uncorrelated. Note that folding free energy is also measured compared to a reference state:

) , ( ) , ( ) ,

; ,

(

_n _r _n _r

measured

S C S C G S C G S C

G = == = − −− −

∆ ^(5.4)

where S is the sequence of the considered protein while C

_n

and C

_r

are the native and reference conformations, respectively. This reference conformation usually corresponds to the unfolded protein state for which sequence and structure are assumed to be uncorrelated. In order to obtain the unfolded or denatured form of a protein, different experimental means can be used (Lehninger et al., 1993). For instance, the non-covalent bonds stabilizing the protein native conformation can be disrupted thanks to thermal energy from heat, extreme pH or chemicals as urea. However, the theoretical reference state used to compute ∆ G ( C , S ) does not correspond necessarily to the unfolded protein state.

The next section exposes the selected data-derived potentials used in PoPMuSiC to model the free energy changes caused by mutations.

5.4.2.2 Selected potentials and Protein representation

Two kinds of statistical potentials have been selected in PoPMuSiC to model

the free energy changes caused by mutations. First, distance potentials are

chosen to describe interactions between residues close to each other in space

while torsion ones concern local interactions like the correlations between

residue types and backbone torsion angles of residues close along the sequence

(figure 5.4). Whereas detailed protein description is compatible with such

potentials, PoPMuSiC relies on a simplified protein representation. Indeed, the

macromolecules are represented by their main chain atoms (N, C

α

, C and O

(20)

(figure 5.1)) and pseudo-atoms C

µ

which model the side chains. For each amino acid kind, a C

µ

is defined as the geometric centre of all heavy side chain atoms (non hydrogen) of the considered residue type, in a dataset of known structures (figure 5.8 - Kocher et al., 1994). Therefore, C

µ

has a well-defined position for each amino acid kind, which involves neglecting side chain degrees of freedom.

As for the backbone conformation of a protein, it is defined by the torsion angles of its residues (figure 5.9). These angles are grouped in seven domains called A, C, B, P, G, E and O whose limits can be found in figure 5.10.

Figure 5.8: examples of C

_µ

atom positions for 2 different amino acids represented with all the side chain conformations in a database of known protein structures (Kocher et al., 1994)

Figure 5.9: backbone torsion angles (Dehouck, 2005)

Side chain conformations in a

database of known protein structures

(21)

Figure 5.10: Ramachandran plot which represents the accessible ranges for the three backbone torsion angles (Dehouck, 2005)

On the basis of this protein representation, two statistical potentials have been selected to estimate free energy changes caused by mutations. These energetic functions are computed on the basis of observed frequencies of sequence and structure patterns in a dataset consisting of high-resolution protein X-ray structures. These reference proteins are chosen to present less than 25%

sequence identity or no structure similarity with the proteins to be mutated.

Indeed, too strong similarities with tested mutants may lead to skew the energetic function values.

Torsion potentials describe only local interactions along the sequence. They take the propensities of ( φ, ψ, ω ) backbone torsion angle domains and pairs of ( φ, ψ, ω ) domains to be associated with a given residue. Their expressions are the following:

) ( ) (

) , ln ( )

, (

j i

j i j

i

ts

F t F s

s t kT F

s t

G ≈ ≈≈ ≈ − −− −

∆ ^(5.5)

) ( ) , (

) , , ln ( )

, , (

j k i

j k i j

k i

tts

F t t F s

s t t kT F

s t t

G ≈≈ ≈ ≈ −− − −

∆ ^(5.6)

(22)

where t

_i

and t

_k

represent the torsion domains observed at the positions i and k along the amino acid sequence while s

_j

corresponds to the nature of the residue at the position j.

Thanks to these potentials (5.5) and (5.6), contributions to the folding free energy can be defined by

∑

≈

≈≈

≈

j i

j i ts

ts

C S G t s

G

,

) , ( )

,

( ∆

∆ ^(5.7)

∑

≈

≈≈

≈

k j i

j k i tts

tts

C S G t t s

G

, ,

) , , ( )

,

( ∆

∆ ^(5.8)

However, the torsion contributions are not used in these forms. Indeed, the concepts of short-range and middle-range interactions are introduced. So, the equations (5.7) and (5.8) do not change but conditions on the indices i and k are introduced. For torsion short-range contributions, i and k have to be between j-1 and j+1 ( j − −− − 1 ≤ ≤≤ ≤ i , k ≤ ≤≤ ≤ j ++ + + 1 ) while torsion middle range potentials present indices included between j-8 and j+8 ( j − −− − 8 ≤≤ ≤ ≤ i , k ≤ ≤≤ ≤ j ++ + + 8 ).

As for distance potentials, C

_µ

− −− − C

_µ

potentials dominated by non-local and hydrophobic interactions are considered. They are based on the propensities of pairs of amino acids ( s

_i

, s

_j

) at position i and j along the sequence to be separated by a spatial distance d

_ij

, computed between the average side chain centroids C

_µ

(Kocher et al., 1994). Once more, two variants of C

_µ

− −− − C

_µ

potentials are considered. First,

range long−−−−

−− −

−

_µ

µ

C

C describes purely non-local

interactions along the sequence and only takes residues separated by at least 15

residues along the sequence ( j ≥≥ ≥ ≥ i + ++ + 15 ). Secondly, dominated by non-local

interactions, simply called C

_µ

− −− − C

_µ

potential also presents a more local

interaction component. Its non-local component is obtained by considering the

(23)

frequencies of all residues separated by 7 sequence positions and more ( i ++ + + 8 ≤≤ ≤ ≤ j ) while the local component is obtained by computing the frequencies of residues separated by one to six positions along the sequence ( i ++ + + 1 < << < j < << < i ++ + + 8 ).

Once the statistical potentials to be considered are defined, they can be computed on the basis of the observation frequencies within the protein database. On the basis of the obtained potential values and the structures of the considered mutants, the free energy contributions can be calculated. Each kind of structure elements observed in the mutant is characterized by the observation frequencies within the wild-type protein database.

Once the free energy contributions are computed, they have to be combined to determine the free energy changes caused by the considered mutation. This is the subject of the next section.

5.4.2.3 Folding free energy changes

To estimate stability changes caused by single-site mutations, folding free energy changes have to be computed. Its expression is the following

) , ( ) , ( ) ,

; ,

( S

_m

C

_m

S

_w

C

_w

G S

_m

C

_m

G S

_w

C

_w

G ∆ ∆

∆∆ == = = − −− − (5.9)

where C

_m

and C

_w

are the mutant and wild-type conformations and S

_m

and S

_w

the mutant and wild-type sequences, respectively. With this convention, ∆∆ G is positive when the mutation is destabilizing and negative when it is stabilizing.

Note that C

_m

and C

_w

are assumed to be nearly identical. More precisely, the

backbone conformations are identical while the position of the pseudo-atom C

_µ

of the mutated residue is different in the mutant and the wild-type structures.

(24)

As mentioned in the introduction of section 5.4.2, the authors of PoPMuSiC propose to compute folding free energy changes ∆∆ G with linear combinations of the statistical potentials described in the previous section. Previous analyses (Gilis and Rooman, 1996; Gilis and Rooman, 1997), based on a test set of 344 experimentally studied mutations introduced in seven proteins and a synthetic peptide, demonstrate that the combination is influenced by the solvent- accessibility of the mutated residue. This accessibility A is computed as the solvent-accessible surface of the mutated residue in the wild-type structure, over its solvent-accessible surface in an extended tripeptide Gly-X-Gly (Rose et al., 1985). In the work of Gilis and Rooman (2000), solvent-accessibility is computed thanks to the algorithm SurVol (Alard, 1991) and is expressed in percentage.

Gilis and Rooman (2000) propose to divide mutations into three subsets depending on the solvent-accessibility value of mutated residue. Hence, a first linear combination corresponds to mutated residues at protein surface, with a solvent-accessibility A ≥ ≥≥ ≥ 50 % . Afterwards, half-buried residues with a solvent- accessibility between 20 and 40% are considered in a second linear combination.

Finally, totally buried mutated residues characterized by a solvent-accessibility less than 20% are taken into account in a third linear combination.

mol kcal G

G

A₅₀_%

= == = 0 . 87

torsionshort range

− −− − 0 . 25 /

−−−

≥ −

≥≥≥

∆∆

∆∆ (5.10)

mol kcal G

G

A

0 . 89

torsionshortrange

0 . 62 0 . 47 /

µ µ C C

% 40

20_<_<<_< _≤≤_≤_≤

= == = ++ + +

₋₋₋₋

+ ++ +

−

−−

−

∆∆

∆∆ (5.11)

mol kcal G

G G

range range long

middle

torsion

A

0 . 42 1 . 05 1 . 18 /

µ µ C C

%

20

= == = + ++ + + ++ +

−−−

− −

−−− −−−−

<<<

<

∆∆ ∆∆

∆∆ (5.12)

In these equations, the parameters have been determined by the trial-and-

error method and the ∆∆ G

_i

(

_i₌₌₌₌_torsion_short₋₋₋₋_range

_, _C

_µ₋₋₋₋

_C

_µ

_,

_torsion_middle₋₋₋₋_range

_,

(25)

range long−−−−

−−−

− _µ

µ

C

C ) correspond to the difference between the previous computed statistical potentials for the mutant and the wild-type protein.

) , ( ) ,

(

_m _m _i _w _w

i

G S C G S C

G ∆ ∆

∆∆ = == = −− − − (5.13)

As for the interpretation of the equations (5.10-5.12), these expressions highlight that the weight of the local interactions described by torsion potentials increases when approaching the protein surface, whereas the weight of hydrophobic interactions described by C

_µ

− −− − C

_µ

potentials increases when penetrating in the protein core.

The next section discusses the performances of PoPMuSiC as well as its limitations.

5.4.2.4 PoPMuSiC performances and limitations

The tests carried out to validate the above described method for predicting stability changes upon single-point mutations confirm the good performance of the PoPMuSiC program. The linear correlation coefficients between measured and computed ∆∆ G are included between 0.8 and 0.87 on 279 out of a set of 344 mutations. The considered mutants are introduced in various environments of seven different proteins and a synthetic peptide thereby suggesting the universality of the approach.

However, in spite of its predictive power, PoPMuSiC suffers from limitations. Among the 344 experimentally studied mutations, only 279 give good results. Indeed, no linear combination of the considered potentials allows to evaluate the mutations of residues with a solvent accessibility between 40%

and 50%. Moreover, if a mutation causes drastic structural rearrangements in the

protein, the equations (5.10-5.12) are not valid as they rely on the assumption

(26)

that the wild-type and mutants structures are very similar. Since prolines can be particularly suspected of modifying the backbone structure, such mutations are excluded or the corresponding predictions are at least handled with care.

However, such single-point mutations, which affect the protein structure, constitute a small minority of all possible mutations; the inability of representing such a phenomenon is thus a small limitation.

Another weak point of PoPMuSiC is the absence of classical cross validation. However, the statistical significance of the proportionality coefficients of the model (5.10-12) is tested thanks to the following procedure (Gillis and Rooman, 1997):

1. 20 randomly chosen mutations are excluded from the original learning set;

2. the optimal weight factors are determined as well as the correlation coefficient on the remaining mutations;

3. these two steps are repeated 1000 times to conclude to the reproducibility of the results.

Finally, some authors (Parthiban et al., 2006) blame PoPMuSiC for the limited databases used to validate it. Today, large mutant databases, like ProTherm (Gromiha et al., 2000), exist and are exploited by other teams working on the subject.

5.5 PoPMuSiC improvements

Considering the good performance of PoPMuSiC and its limitations, it seems

interesting to improve the program. First, since the development of PoPMuSiC,

large databases of known protein structures have been constituted. In order to

exploit the large amount of structural data, more complex potentials can be

derived. Such informative energetic functions can consist of residue contact

(27)

potentials or distance potentials depending simultaneously on the residue solvent-accessibility, the backbone conformation or the relative orientation of the side chains. They can also be local potentials which characterize the propensities of diverse amino acids to be associated with certain local chain conformations. For instance, such potentials can correlate several residues to numerous backbone torsion angles at diverse positions. Such approaches allow to compensate some potential dependence as we will explain in the next section.

The second possible improvement of PoPMuSiC concerns the free energy model structure. Since local potentials dominate at the protein surface while hydrophobic interactions are stronger in the protein core, the idea is to replace our folding free energy model (5.10-5.12) by only one linear combination of statistical potentials whose proportionality coefficients vary according to the solvent-accessibility.

Hence, we propose to exploit more complex statistical potentials developed in Dehouck (2005) and Dehouck et al. (2006), and derived systematically from the database of 1403 known structures DB DB DB DB

₁₄₀₃

(Appendix 5.1). As for the variable proportionality coefficients, we adopt Artificial Neural Networks. Two neural networks solutions are tested. First, radial basis function networks are used to cover the different solvent-accessibility ranges while the second test uses MultiLayer Perceptron to activate or not a potential according to the solvent- accessibility value of the mutated residue. These structures are identified and validated on the database ProTherm (Gromiha et al., 2000).

The next two subsections present the considered mean force potentials and

the artificial neural network propositions while subsection 5.5.3 describes our

mutant database before showing the different results.

(28)

5.5.1 Mean force potentials

In Dehouck (2005) and Dehouck et al. (2006), a general procedure able to derive complex potentials has been developed. It allows to get energetic functions based simultaneously on several descriptors of sequence and structure.

Typically, local potentials analyse the correlations between the residue type, the backbone torsion angles and the solvent-accessibility of residues in close vicinity along the sequence whereas the influence of these three descriptors on the spatial distance between two residues is considered within distance potentials. The performances of these energetic functions have been assessed on the basis of their ability to discriminate genuine proteins from decoy models.

Note that these potentials rely on the protein representation used in PoPMuSiC and described in section 5.4.2.2.

5.5.1.1 Local potentials

Based on backbone torsion angles

From the equation (5.1), it is possible to develop a basic potential ∆ G

_ts

which describes the propensities of different amino acid kinds to be associated to some backbone torsion angle ranges:

) ( ) (

) , ln ( )

, (

j i

j i j

i

ts

P t P s

s t kT P

s t

G = == = − −− −

∆ ^(5.14)

From such a potential, an estimation of the folding free energy of a protein constituted of N residues, with a sequence S and a conformation C is given by

∑

≈

≈≈

≈

j i

j j ts

ts

C S G t s

G

,

) , ( )

,

( ∆

∆ ^(5.15)

(29)

where the sum is considered on a window i − −− − j ≤≤ ≤ ≤ F

_loc

with F

_loc

= == = 2 the window size whose value was determined during the assessment of the statistical potentials (Dehouck, 2005; Dehouck et al., 2006). Such a basic potential is already used in PoPMuSiC (eq. 5.5). However, the sum (5.15) of the different contributions related to the numerous couples ( t

_i

, s

_j

) is a strong approximation since these contributions are not necessarily independent. For instance, the conformation of a residue depends not only on the nature of the amino acid but also on the type of the close residues; this dependence is so complex that it could not be reduced to the single sum (5.15). In order to avoid such a problem, other potentials can be introduced. For instance, a small sequence element can be defined by a pair of residues ( s

_i

, s

_j

) instead of a unique residue. This allows to translate the correlation between the conformation of a residue at position i and the nature of the residues at position j and k.

) , ( ) (

) , , ln ( )

, , (

k j i

k j i k

j i

tss

P t P s s

s s t kT P

s s t

G ′′′′ = == = − −− −

∆ ^(5.16)

However, this potential cannot be directly summed on the set of triplets at positions i, j and k. Actually, the contribution related to one triplet includes effects of the couples ( t

_i

, s

_j

) ^and ( t

_i

, s

_k

) and thus is not independent of the contribution related to the triplet ( t

_i

, s

_l

, s

_k

) . In order to cancel out this interdependence, Dehouck (2005) and Dehouck et al. (2006) propose a new potential kind which is obtained by rewriting the equation (5.16).

 





 



− 

−− −

== =

′′′′ =

) , ( ) , ( ) , (

) ( ) ( ) ( ) , , ( ) ( ) (

) , ( ) ( ) (

) , ln (

) , , (

k j k i j i

k j i k j i k i

k i j i

j i k

j i

tss

P t s P t s P s s

s P s P t P s s t P s P t P

s t P s P t P

s t kT P

s s t

∆ G

(5.17)

(30)

Indeed, thanks to this mathematical development, terms related to couple effects ( ∆ G

_ts

( t

_i

, s

_j

), ∆ G

_ts

( t

_i

, s

_k

) ) can be isolated while a new potential

) , , (

_i _j _k

tss

t s s

∆ G , called coupling term, compensates the interdependence of the two first terms.

) , , ( )

, ( )

, ,

(

_i _j _k _ts _i _j _ts _i _k _tss _i _j _k

tss

t s s G t s G t s G t s s

G ∆ ∆ ∆

∆ ′′′′ = == = ++ + + + ++ + (5.18)

where

) , ( ) , ( ) , (

) ( ) ( ) ( ) , , ln ( )

, , (

k j k i j i

k j i k j i k

j i

tss

P t s P t s P s s

s P s P t P s s t kT P s

s t

G == = = −− − −

∆ ^(5.19)

Hence, independent contributions of different triplets ∆ G

_tss

( t

_i

, s

_j

, s

_k

) can be summed on positions i, j and k.

∑

≈ ≈≈

≈

k j i

k j i tss

tss

C S G t s s

G

, ,

) , , ( )

,

( ∆

∆ ^(5.20)

where i − −− − j ≤ ≤≤ ≤ F

_loc

, i −− − − k ≤ ≤≤ ≤ F

_loc

, j − −− − k ≤≤ ≤ ≤ F

_loc

with F

_loc

= == = 2 the window size determined during the assessment of the potentials.

Equation (5.18) promotes the simultaneous use of the two presented potential kinds to estimate the folding free energy: first, ∆ G

_ts

(eq. 5.14) which represents the contribution due to couples of sequence and structure elements and, secondly, ∆ G

_tss

which corresponds to the sum of the independent elements

) , , (

_i _j _k

tss

t s s

∆ G and compensates the interdependence of the ∆ G

_ts

( t

_i

, s

_j

) .

In the same way, a coupling term ∆ G

_tts

( t

_i

, t

_j

, s

_k

) can be defined to take the

possible interdependence of ∆ G

_ts

( t

_i

, s

_k

) and ∆ G

_ts

( t

_j

, s

_k