Some physical approaches to protein folding

(1)

HAL Id: jpa-00246718

https://hal.archives-ouvertes.fr/jpa-00246718

Submitted on 1 Jan 1993

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

J. Bascle, T. Garel, Henri Orland

To cite this version:

J. Bascle, T. Garel, Henri Orland. Some physical approaches to protein folding. Journal de Physique

I, EDP Sciences, 1993, 3 (2), pp.259-275. �10.1051/jp1:1993128�. �jpa-00246718�

(2)

J. Phys. I France 3

(1993)

259-275 FEBRUARY 1993, PAGE 259

Classification Physics AbstTacts

05.90 61.40D _87.10

Some physical approaches to protein folding

J.

Bascle,

T. Garel and H. Orland

Service de Physique

Thdorique(*)

CE-Saclay, 91191 Gif-sur-Yvette Cedex, France

(Received

15 May1992, accepted 5 June

1992)

R4sum4. Le repliement des protdines eat _un probl+me qui _a ^de nombreuses implications biologiques. Dana cet article, _nous pr6sentons, de deux fa&ens difl4rentes, _un point de _vue de physicien. Nous introduisons tout d'abord des mod+les simples de m4canique statistique qui exhibent, h la limite thermodynamique, des transitions de repliement. Ces mod+les peuvent Atre divis6s _en (I) _verres de spin

(6ventuellement

k la

Mattis),

al l'on peut ^chercher ^des corr41ations entre [es interactions intrachaine _et la structure replide,

(it)

_verres, al l'on _met l'accent _sur la

comp4tition g40mdtrique entre l'ordre local uni- _au bi-dimensionnel

(qui

modble [es structures

en hdlices _a _au _en feuillets

fl),

et la contrainte

globale

de compacitd. Ces deux types de modbles sent trap simples _pour l'dtude de vraies prot4ines, mars its devraient s'appliquer dons le domaine de la transition vitreuse, des polymbres collaps4s,... La deuxibme voie d'dtude _eat _une m4thode

Monte-Carlo, al

on fait croitre la protdine atome par atome

(au

rdsidu _par

rdsidu),

I l'aide d'une forme donnde de

l'dnergie

totale de la protdine

(CHARMM,...).

Cette m4thode pent ^dtre alors compar4e _aux autres m4thodes numdriques; _nous _comparons ainsi _nos rdsultats _avec des calculs de dynamique moldculaire _pour le _cas des poly-alanines. Cette double approche eat _une bonne illustration des difficultds _que l'on rencontre dons le probl+me du repfiement des protdines (nombreux 4tats m4tastables,...).

Abstract. To understand how _a protein folds is _a problem which has important

biological

implications. In this article, _we would like _to present a physics-oriented point of view, which is twofold. First of all, _we introduce simple statistical mechanics models which display, in the thermodynamic limit, folding ^and related transitions. These models _can be divided into (I) crude spin glass-like models

(with

their Mattis

analogs),

where _one _may look for possible correlations

between the chain self-interactions and the folded structure, (it) glass-like models, ^where one

emphasizes the geometrical competition between _one- _or two-dimensional local order

(mimicking

a helix _or fl sheet

structures),

and the requirement of global compactness. Both models

are too

simple to predict the spatial organization of _a realistic protein, ^but _are ^useful ^for the physicist

and should have _some feedback in other glassy _systems

(glasses,

collapsed

polymers,...).

These remarks lead

us to the second physical approach, namely _a _new Monte-Carlo method, where _one grows the protein atom-by-atom

(or residue-by-residue),

using _a standard form

(CHARMM,...)

for the total energy. A detailed comparison with other Monte-Carlo schemes, _or M61ecular Dynamics calculations, is then possible; _we will sketch such

a comparison for poly-alanines.

Our twofold approach illustrates _some of the difficulties _one encounters in the protein

folding

problem, in particular those associated with the existence of

a large ^number of metastable states.

(*)

Laboratoire de la Direction des Sciences de la Matibre du Commissariat I l'Energie Atomique

(3)

1 Introduction.

Proteins _are

weakly

branched

polymers,

built _out of twenty

species

of _monomers

(aminoacids).

They

have the property ^of

folding

into _an

(almost) unique

compact ^native structure, which is the

biological interesting object [I].

The compactness ^is

largely

due to the existence of the

hydrophobic

aminoacid

residues,

since these

biological objects

_are

usually designed

to work

in _water. Both the compactness ^and the chemical

heterogeneity

of _a

given protein

tend to

slow down

dynamical

_processes, and the

question periodically

^arises _as to whether the

protein folding problem

in under

thermodynamic

_or kinetic control. This

question

is not unfamiliar in

the

physics

of

glassy

systems where the _same

problem

of _a _very

rugged phase

_space is present.

In

physical

terms, the frustration in _a

protein

_can be

naively

^described in _two different _waysi

(I)

the _energy of _a

protein

is the _sum of bonded

(geometrical)

and non-bonded

(Coulomb,

Van Der

Waals)

terms, ^which cannot be

simultaneously

satisfied.

(ii) experimentally (Cristallography, NMR),

_a folded

protein

has _a local order due to

hydrogen-bonds.

This order

is, roughly speaking,

of one-dimensional

(a helix)

_or twc-dimensional

(fl sheet)

nature and is therefore

incompatible

with the

requirement

of

global

compactness.

In this

ch8pter,

_we shall follow the

thermodynamics approach

to

folding; simple

statistical mechanics models will be studied for

points (I)

and

(it).

For the

former,

_one is

naturally

lead to draw _a

parallel

with the

spin glass problem (the quenched

disorder

being

^linked to the

primary structure),

whereas the latter is _more akin _to

glasses.

Both of these

approaches

have

interesting

outputs. ^The

spin glass

_[2] _case and its Mattis

analogs

_[3] suggest a connection with the

physics

_[4] of neural

networks, Hopfield model,..,

for the

major

^unsolved

problem

of the

coding

of the

tertiary

structure in the

primary

structure. The

"glassy glass"

_case _[5]

points

towards

interesting

differences between helices and

sheets,

and revives

Flory-like

models of

polymer melting,

_as well _as the Gibbs-Dimarzio

theory

of the

glass

transition

[6,7].

On a more realistic

level,

_we also wish to benefit from the

biologists' experience

with their

complicated

systems. It is therefore _necessary to go

beyond

the above

qualitative

picture and

study

well-defined entities. We have therefore devised _a _new Monte-Carlo

(MC)

method to generate a Boltzmannian ensemble of

configurations

of _a

protein.

This method _uses

an

empirical

form for the total _energy of the

protein,

and _may be

loosely

described _as _an atom-

by-atom growth

of the

protein,

in marked contrast to other MC methods [8] or to Molecular

Dynamics

_[9]

(MD)

calculations. This

growth procedure

_was introduced [10] ^to try to

efficiently explore

the

rugged landscape

of _a

protein phase

_space and _may be

coupled

to _more traditional

techniques (simulated annealing

_[11], minimization

procedures,...).

We have tested [12] ^the method _on

peptides

with _a small number N of atoms

(alanine dipeptide (N

₌

22),

_penta-

alanine

(N

₌

53)), by comparing

the energy minima with those obtained

by

MD simulations

(CHARMM).

A short review of the _same

comparison

_[13a] with

hepta-alanine (N

₌

73)

is

given below, together

with

preliminary

results [13b] on

twenty-alanine (N

₌

203).

At this

point

_a caveat _seems in order: all the

methods, including

_ours, _are faced with the

problem

of the

solvent,

and _use at best _an effective _energy function

taking

in _a crude _way the effect

of the _water molecules

((or

instance in the

collapsed regime,

a one-hundred residue

protein

has

something

^like ^half of its atoms _on the

surface).

Our simulations _are

usually

done with _a dielectric constant

equal

to

unity (vacuum-type calculations).

The

layout

of the _paper is _as follows. Section 2

briefly

deals with the

spin glass-like approach

to the

folding

transition, as well _as the related Mattis models

(coding

I la

Hopfield,.. ).

The

"glassy glass"

_case is studied in section 3, ^where ^the ^link with the

Flory-Gibbs-Dimarzio theory

of the

glass

transition is discussed. We

emphasize,

in this context, the existence of _a disorder

point.

Section 4 describes the _new AIC

growth

method and its

application

to

poly-alanines

(Sect. 5).

(4)

N°2 SO&fE PHYSICAL APPROACIIES TO PROTEIN FOLDING 261

2.

Spin glasses

^and

folding

transitions.

We model _a

protein

_as _a chain of N links

(residues),

,vhere link I

(at r;)

and link

j (at rj)

interact

through

_a

potential

_u;j

(r;

_rj

),

which

depends

_on the chemical _nature of links _I and

j. Physically,

_{ujj is}

expected

to be

relatively short-ranged (screened

Coulomb _or _van der lvaals

interactions,.. ).

^We ^take ^for

simplicity

u;j

(ri

_rj ₌ _u;j b

(r;

_rj

(I)

Two types ^of ^model can be studied [2, 3].

(I)

^the

spin glass

^model: the interactions

(uj)

_are taken _as

independent

random variables

distributed,

for

instance, according

to _a Gaussian distribution

~ ~~"

~#~~~ _12~2

~~'J

°~~) (2)

In

equation (2),

_uo denotes the excluded ;olume effect

(in appropriate units).

A

"biological"

interpretation ^of ^this approach ^is the

follo,ving:

the interactions

(u)

bet,veen the _same

couples

of

residues,

but at different

places

in the

primary

_sequcnce, _are

totally

uncorrelated because of their different environment.

(One

_may also link this

approach

to the

travelling

salesman

optimization problem [14]).

(ii)

the

separable

model: the interactions

(u;j)

_are taken _as _a _sum of AI

separable

terms

AI

~,,

~j

~ ~P~P

(3)

11 P I _j

p=i

,vhere the

(f[)

_are

taken,

for instance, _as

indepciident

mndoni,<ariables with Gaussian distri- butioii.

Apart

from thc excluded iolume

efl'ect,

thc

"charges" (f[;

_p

= I,...,

AI)

_can _represent

the Coulomb

charge,

thc

liydropliolJicity,

^the

I;clix-forming

_or

breaking tcndency,..

The "bi-

ological" interpretation

is

opposite

to the

prcvious

_one:

here,

each residue is dcfiued

by

M

independent "ch;irges",

_or chJracters, ^ii~hicli

dcpends onl»

_on its chemical _nature and not _on its

position along

^the

primary

_sequence.

In the continuous

limit,

the

partition

^functi~n ^of ^these ^models ^reads [16]

Z ₌

/ _Dr(s)exp _(-

^~

_/~

_ds ^~~ ^~

^~ _/~ _/~

_ds _ds'u

_(s, _s')

₆

_(I(s)

r

(s') )lx

2

o ds 2

o o

~ ^S ^S ^S ^~~~

~~~~ 6 ^~~

~~' ~~" ~_~~~~~ _~~~'~~ ~~~~'~

"~

The last _term is included to avoid _a total

collapse

of the chain, its

usual,

the paraineters ⁱⁿ

equation (4)

_are the space dimension

d,

the in;crsc temperature

fl

₌ _~,, and S

= Na~

(n.here

a is the _common

length

of the

links).

Introducing replicas

_[4] to

pcrform quenched averaging

ovcr the disordered intcractions

(v(s,s')),

_we _get the

following

results

[2,3]

: there _are three

phases, namely,

_a

high

tem-

perature ^coil state, an intermediate teniperature

collapsed phase

with _a

macroscopic

entropy

(similar

to a

polymer

below tile 8

point),

and

finally

_a low temperature

collapsed

frozen

phase.

Detailing

the above

models,

_,ve have:

(5)

(I)

^the

^spin glass

model: the low temperature

phase

is _a Potts

glass

with _p

- oo states

[16],

^at ^least ^for

high enough

dimensions. The

(mean field)

order parameter ^of ^the

freezing

transition is:

Qnfl (r,

_r

')

=

2flv /~

^ds

(b (r ra(S))

b

(r

^'

rfl(S))) (5)

where I < _a _<

fl

< n

(and

_n _-

0),

and (. ^denotes a thermal average with respect to the

replicated

Hamiltonian of

equation (4). Alternatively,

the

freezing

transition _can be studied

by

the

overlaps

of two

(real) copies

of the system [17]. ^As ⁱⁿ

Ising spin glasses

_[4], _one _may _argue

that there _are few dominant states in the system, which could be

interpreted

in terms of _a few dominant folded structures. Numerical calculations

along

^these

lines, including dynamics,

have been

recently reported

_[18]. Note

that, by construction,

these

"protein

models" _possess ultrametric

properties

[19]. ^For a more realistic _case, _see section 4.

To conclude _on this type of

approach,

it is

interesting

to

point

out that _a variational function introduced

by

Shakhnovich and Gutin [20] ⁱⁿ ^the context of

proteins,

has been used in other disordered solid _state situations

[21].

(ii)

the

separable

model: when _one

performs

the

quenched

_average _over the

(ff )

in

equation (4),

_some mean field order paramters _appear

naturally

in the system ^such as

mp,a

jr)

=

/~

^ds

_fp

₁₆

(r ra(S))) (6)

In

equation (6),

_a is _an

unimportant replica

index

(to

be omitted from _now

on); _(.

and denote

respectively

thermal and disorder averages. These order parameters ^{h la}

Hopfield

[22] are a measure of the correlation between the chemical nature of _a link

(characterized by

(fp Is))

^and ^its

position

_r in _space. There is _a Mattis-like

freezing phase

transition where _some

(mp (r)

_p ₌ 1, 2,..

,

Mo)

condense. For these MO

characters,

the

primary

_sequence codes for the

spatial

structure of the chain. If there is

only

_one

"charge"

_or character

(e.g. hydrophc- bicity),

the

folding

transition will translate into _a

spatial separation

between

hydrophobic

and

hydropholic

links. In

general,

_a Mattis-like transition

implies

_a

single

dominant

spatial

structure, with

a

large

number of metastable states

[22(b)].

Note that when M

increases,

at fixed N, we expect _a ^smooth _crossover from the

separable

to the

spin glass

model. From

simple qualitative

arguments [23], ^it can be inferred that in real

proteins,

one should have M

= 8 relevant

(and independent)

characters for each residue. Thus for _a ₌

fl

small

(I.e, long chains),

_one deals with the

separable

_case, whereas for _a

larger (short chains),

the

glassy

model is _more

appropriate.

The critical _a is of order [22, 23] a~ ct.I

(which gives,

in this

model,

_a critical

length

of N~ _ct

80).

Similar

coding

schemes

using

Protein Data Banks have been studied

by Wolynes

and coworkers [24].

3. Glasses and

folding

transitions.

3.I THE MODEL. The

energetical

frustration described above is note

quite satisfactory,

since there is _no real disorder in

proteins.

We will

see in section 4 that _a

commonly

used form of the total energy of _a

protein

is the _sum of

a bonded

(geometric)

part and of _a _non- bonded

(Coulomb,

Van der

Waals)

_part. In

particular,

the Coulomb part ^is

responsible,

in this

formalism,

of the formation of

hydrogen

bonds [25] ^that ^tend to

locally

stabilize _one-

dimensional

(a helix),

or two-dimensional

(fl sheets)

structures. See

figure

I. Since

we know

that the

biologically

active

protein

is compact

[I],

we are

typically

faced with the

problem

(6)

N°2 SOME PHYSICAL APPROACHES TO PROTEIN FOLDING 263

of

geometrical frustration,

where local and

global

orders _are

incompatible.

This

approach

is familiar in

glasses

where _one tries _to solve this contradiction

by

_a

mapping

onto _a curved space

[26].

^We ^choose ^here a

thermodynamic approach

and

model,

_as _an

example,

the _o helix

case in the

following

_way: _we consider _a d-dimensional

hypercubic

lattice of N ₌

L~ sites,

with

periodic boundary conditions,

and its associated Hamiltonian

paths.

We recall that _a Hamiltonian

path

visits all sites of the lattice _once and

only

_once. Hamiltonian

paths

have been often used to model

collapsed polymer globules [15]. Following Flory [6a],

_we take each link of the Hamiltonian

path

to represent a helical _turn. Since

hydrogen-bonds

have _a

tendency

to favor

long helices,

^that ^is to

align

the links of _our

model,

_we attribute _an energy

penalty

_e to the

breaking

of _an helix, that is whenever the Hamiltonian

path

makes _a turn

(corner).

This model has attracted _a lot of attention in the

theory

of

polymer melting ii, 27].

For

simplicity,

we consider closed

paths, but,

_as is well known in

polymer theory [16], boundary

conditions

play

_a role

only

in subdominant _terms of the free _energy. The

partition

function of the system,

at inverse temperature

fl

₌

,

reads

z =

~ e-P£N,jl~j

~~~

jl~j

CO

RN

co MN

NH

NH °C

NH DC

NH

(a) (bi

Fig. I. Schematic representation of hydrogen bonds in

(a)

a-helix,

(b) (antiparallel)

p-sheet-

where

(7l)

denotes the ensemble of all Hamiltonian

paths,

and

N~(7l)

denotes the number of

corners present ⁱⁿ

path

7l.

Following

^reference [28], one may rewrite Z _as

f fl$~~

_d~an

(r)

^e~~G

fl~ (£~ )~aJ (r)

+ e~fl~

£~

~~ ~an

(r)

_~a~

(r))

Z = lim

~

(8a)

"-° _n

f fl~_~

d~aa

(r)

e-AG

with

AG

=

jj _[

_~an

_{(r) (Air} _,)~~

_~an

_{(r ')} (8bi

~= m

(7)

where _~oa

(r)

is _an n-component

(n

₌

0)

real

field,

defined in each direction _a ₌

I,..., d,

attached to all

points

r of the lattice. The operator

AQ~,

îs Î îf r and r' _are nearest

neighbours

in direction _o and 0

otherwise; (AQ~,)~~

^denotes ^its ^inverse.

Using

Wick's theorem and

extracting, through

the _n

= 0 trick

[29],

^the contribution of all connected

paths,

it is

easily

shown that

(8a)

^and

(7)

_are

equal.

Note that in the above

description,

_one does not consider the

primary

_sequence _anymore, in marked contrast to the

approach

of section 2. In the _non

weighted problem (e

₌

0),

the

saddle-point (SP)

method of reference [28]

yields

Zsp(e

₌

0)

₌

(~) ⁽⁹⁾

e

~

where _q

= 2d is the lattice coordination number and _e _ci 2.71828...

Equation (9)

is in excellent

agreement

^with ^numerical

data,

in marked contrast to the "old"

Flory theory

[6a] ^which

gives

ZF(e

₌

0)

₌

(~ ⁽¹⁰⁾

e

~

3. 2 THE HIGH TEMPERATURE ISOTROPIC APPROACH. We have extended the SP

approach

to the model defined in

equations (8).

We get

§ ~~~

'~ ~~ ~~'~

l Co @(~/~~~i~(~i~)~~'

i'fl

(~)

^~~~~

At

high

temperature, ^{it is} ^natural to look for _a

homogeneous

and

isotropic solution,

_~OJ

(r)

₌ _~o.

We break the

O(n)

symmetry

by choosing

_~a in _a

given "direction",

_say 0, and obtain

~a( =

~

(12a)

and

ZSP(61

₌

(M)~ (12bj

with

q(fl)

₌ 2 ₊

2(d _I)e~~~ (12c)

The "old"

Flory theory [6a]

would

yield

ZF(e)

₌

~~~~~)~

(13a)

e

where

qF(iii

_" 1+

2(d _I)e~~~ (13bj

Both

approaches

have the

following properties:

(I)

^there ^exists a temperature TG where the

entropy

^vanishes. ^This

remark,

in the framework of the

Flory theory,

is the basis of the Gibbs-Dimarzio

theory

of the

glass

transition

[6b].

(ii)

before _one reaches

TG,

there is _a first order

freezing

^transition at

Tc,

such that

q

(flc)

₌ _e

(14)

(8)

o

-J 5

-2 0

0 2 3

I

Fig. 2. Various approximations to the free energy of the glass model of III

as a function of temperature. Curve

(I)

is the "old" Flory theory. Curve (2) ^is the low temperature anisotropic saddle point result

(with

the disorder point at TD Ci 2.24

e).

Curve

(3)

is the

high

temperature isotropic saddle point result. In all _cases, the transition _occurs when the free energy vanishes.

The low temperature

phase

is frozen

(Fig. 2),

since it consists of

fully

stretched

paths making

turns at the surface.

Using (12c)

and

(13b),

_we get, ^for ^d = 3

Tc[~p

ci 0.58 _e

(Isa)

for the SP

approach

to model

(8)

^and

Tc[~ ci 1.18 _e

(lsb)

for

Flory's theory.

However,

_as

pointed

out

by Gujrati

and coworkers

iii,

such _a

freezing

transition cannot be

thoroughly

correct, ^since ^the ^free energy may be shown to be

strictly negative

at low tem-

peratures. ^This

(slight)

correction to the

Flory freezing

scenario _comes from _one dimensional excitations that _are not well treated in _an

isotropic

SP

approach.

3.3 THE LOW TEMPERATURE ANISOTROPIC APPROACH.

Considering

the above _men-

tioned criticism of the

isotropic

SP

approach,

_we have considered [30] an

anisotropic approach

to the model described in

(8a)

and

(8b):

_we treat

exactly

_one direction of the

lattice,

_say

I,

and treat the

(d I) remaining

directions in _a _mean field

(saddle-point) approach.

Using

the fact that the denominator of

equation (8a)

_goes to one when _n goes ^to zero, we

rewrite

(8a)

as

Z _# llDl

f ^jj

_d§2a

(~) ~~~~ fl (~ )~'? ^(~)

^~ ^~~~~

ll

~~ ~~~ "

~~j

~~~~

n-0 n

~_~ r n o<fl

which _we

approximate by

Zi _ci lim

/

_d~ai

_(r) e'~ie~ +(~~~)~

^~

fl

(~

^~a~~(r) ⁺ ^A ^~ai

^(r)

^{~l +} ^C ^~l

~l(17)

n-0 _n 2

~

(9)

~~~~~

A~ ₌

~

_~gi

_(r) _(A)r _,)

_~~1°1

(r') _(~~)

~

~,~'

and

A ₌

(d I)e~~~ (19)

and

C =

~~

~

~~

(l

₊

(d 2)e~~~ (20)

In

(17),

_{~l is} the

(mean-field)

value of _~an, _a

#

I.

Integrating exactly (17) yields

_a free _energy per site

fl

ii ₌ ^~~ ^~~~ ^~

Log (l

⁺ ^C~I ^~ ⁺

((l

+ C~I ~)~ ⁴

(C _A~)

_~l

~)

~~j (21)

4 2

Equation (21) exhibits,

at T' _ci 0.68 _e

(d

₌

3),

_a first order transition

(Fig. 2)

^between _a

frozen

phase (cristal)

with ~l = 0 and

a

high

temperature

(liquid) phase

with _~l

#

0. At this

order,

the free energy is _zero in the frozen

phase,

but becomes

negative

if fluctuations

(in

_~l)

are taken into account

(30].

^In _any _case, the corrections to

Flory's free2ing

picture _are weak.

In the

high

temperature

phase however,

_we have found [30] a disorder

point

of the second kind [31], ^where ^the nature of the correlations

along

direction I

changes.

The disorder

point

TD is

given by

C = A~

(22)

For d

= 3, we get TD t 2.24 _e; in _a

polymeric chain,

such _a disorder

point

is

likely

to have

more severe

dynamic implications

than in usual

spin

systems [32].

3.4 CONCLUSION. We have also considered [33] ^the case of

fl

sheets and found similar conclusions. The

isotropic

SP

approach

should be "better" in this _case since twc-dimensional

long

_range order _may exist _at finite temperature: we

get

_a first order

freezing

transition h la

Flory.

The results of these

geometrically

frustrated models should be relevant for other

thermody-

namic systems, ^such as

glasses [34], polyelectrolytes

in _a bad solvent

[35],

chiral

liquid crystals [36],...

For

instance,

it is rather

tempting,

in the _case of

glasses,

to

identify

the low temperature

phase

_as the

(unreachable) crystal phase,

and _to link the disorder

point

with the

glass

transition. In the _case of

proteins however,

_one deals with finite systems: we thus

cautiously identify

the low temperature

phase

_as the native structure, ^whereas ^the

high

temperature

phase

looks like _a "molten

globule"

[37]. We _now consider _a _more "realistic"

approach,

which will allow _us to benefit from the

biologists' experience

^with these rather

complicated

systems.

4. The Monte Carlo

growth

^method.

4. I INTRODUCTION. As

previously mentioned,

_one of the main difficulties of the

protein folding problem

is the existence, in

phase

_space, of _a

large

number of local minima. Traditional

single

_move MC methods _are therefore doomed to fail, _as

large

collective motions will be necessary ^to

"untrap"

the chain. One _may

improve

these methods

by using

simulated

annealing

procedures,

_or _any other minimization scheme. We have chosen to devise _a _new MC

method,

(10)

@

-180 J00 20 20 100 180

qidegi

@

180 -loo lo lo loo 180

Q

(degl

Fig. 3. Ramachandran's plots for the third residue of _an hepta-alanine chain

(a)

MC results,

(b)

MD results.

where _one _grows

an ensemble of chains

atom-by-atom (or

residue

by residue), replicating

and

deleting

chains _so _as to

generate

an ensemble that

obeys

the Boltzmann statistics.

(Note

that there _are other methods of

growing

chains atom

by

atom

[38]).

^A ^central ^idea ⁱⁿ ^this ^method

is to avoid to go over

large

_energy barriers

(as

in MC methods where the chain is

completed),

but to go around them. As far _as

comparison

with MD calculations is

concerned,

_our method

does not _assume any

particular

_guess for the initial state. We will illustrate the method for

the _case of linear

polymers

_[10] and its

application

to

poly-alanines [13a,b].

4.2 DESCRIPTION FOR THE CASE OF LINEAR POLYMERS. In this section _we recall the

principles

_on which the method is based. For

simplicity,

_we shall illustrate it _on the _case of linear

polymers [10].

Our aim is _to _construct

a Boltzmann ensemble of

chains,

that

is,

_a statistical ensemble of

(11)

M chains such that the

probability

to find _a chain of _energy E in the ensemble should be

proportional

to its Boltzmann

weight ~,

_where

_fl

=

£j

and Z is _a normalization

factor,

I-e-, ^the

partition

function of the ensemble. In other

words,

the number of chains of energy E

in the ensemble should be

M~

_Since

_M/Z

_is

a constant

independent

of

E,

_we shall say that _a chain of energy ^E should be

replicated

_a number of times

proportional

to

e~flE

_in _the

ensemble.

To generate ^these

chains,

_we _use _a recursive

procedure.

Assume that _we have _a Boltzmann

population

of chains of size _n. In order to obtain _a Boltzmann

population

of chains of size

n +

I,

we add

one atom to each of the

previously generated

chains of size _n, and

replicate

the

new chain the number of times

proportional

to

e~flAE,

where AE is the energy ^cost of

adding

the last _atom.

To illustrate the method in _more

detail,

_we _assume that the

partition

function of the chain is

z ₌

/ _fl _d~r;

exP

(-

^kb

^$ ^(ir;+i ^r;i ^a)~ ^~

^»

^(r;>

^r>

)1(23)

~2 =~

i#j

where _ri

=

0, (r;) being

the

position

of the I-th atom in the chain. The first term represents the elastic _energy of _a link

(of

_average

length

_a and elastic constant

kb),

and

v is _a

2-body potential acting

between the atoms.

We have

deliberately

used _a

simple

form for the _energy in

(23),

but the

generalization

to _a

peptide

chain is

easily performed

_as discussed in section 5 below.

4. 3 REPLICATION-DELETION PROCEDURE. We start with the ensemble of

Ml

atoms _n ₌ I

at ri = 0. Each of these is _a seed for _a chain.

To build chains of

length

_n ₌ 2, for each of the

MI seeds,

_we draw

randomly

_a

position

_r2.

The Boltzmann

weight

associated with the

configuration (ri>r2)

is

proportional

to

In order _to obtain _a

population

of chains

obeying

the Boltzmann

distribution,

_we must

repli-

cate each

(ri, r2)-chain

_a ^number _w2

(ri

_(r2 times. Since _w2 is not _an

integer,

the

replication

is

actually

done in the

following

_way:

Define ₁₂ ₌ Int

(w2)

^the

integer

part of _w2> and _r2

= w2 -12 < the rest.

Then, replicating statistically

_w2 times _means

replicating

₁₂

times, plus

_one additional time with

probability

_r2.

That is to say, one

randomly

generates a number 0 < _r < I. If _r > r2, the chain is

replicated

₁₂

times.

Otherwise,

it is

replicated

₍₁₂ + 1) ^times. ^Since w2 can be smaller than I, the

replication

can in fact amount to _a

deletion,

and the chain is _no

longer

considered in future calculations.

For this _reason _we call this _a

replication-deletion procedure (RDP).

Once the RDP has been

applied

to each

chain,

_we obtain _a Boltzmann-distributed

population

of M2 chains of two atoms.

We _can _now iterate the

procedure

_as follows.

Assume that _we have _a Boltzmann

population

of

Mn

chains of size _n. The number

fi4n

(ri,...,rn)

of chains

(ri, ,rn)

in the ensemble is

proportional,

within statistical _errors, to its Boltzmann

weight:

fi4n in,

,

r~)

₌ A~ _exp

(-pE~ (ri,

, rn

)) (25)

(12)

N°2 SOME PHYSICAL APPROACHES TO PROTEIN FOLDING ₂₆₉

For each chain of the

ensemble,

_we draw the

(n

+

I)-st

atom

randomly

at the

point

_rn+i.

We compute ^the

weight:

Wn+i

(rn+i in,

, rn = exp

I-

^kb

^(lrn+i

rn

a)~ fl ~

v

(r~+i, r;) (26)

~i

We

replicate

the _new chain _wn+i

(rn+i (ri,

,

rn times. Then the number of

(ri>

_rn,rn+i

)-chains

is:

fi4n+1(ri,

, rn,

rn+i)

₌ _wn+i

(rn+i (ri,

, rn

fi4n (ri, ,rn)

=

An

_exp

(-fl ^En ^(ri,

_,

^rn)

⁺ ^~~

^([rn+i ^rn[ ^a)~

⁺

~

2

(27)

+~ v(rn+i,r;)j)

;=1

The last _term in the

exponential

is

just

the total _energy

En+i (ri,..,rn+i)

of the chain

(ri>

, rn+i

)

We thus have:

fidn+i (ri;.,rn+i)

₌ An _exp

(-flEn+i (ri;.,rn+i))

and the _new ensemble of chains of

length (n

₊

I)

is

again

Boltzmann distributed.

By iterating

the

procedure,

_we _see that at each

stage

^{of the} _process we construct _a Boltzmann- distributed ensemble of chains of

increasing

size. We stop when the

required length

is obtained.

The

procedure

_can be modified without alteration of the Boltzmann character of the statistics if _we allow _rn+i to be drawn several times for each chain.

Although

the method _seems

applicable

_as it

is,

_one

immediately

encounters _a

major problem, namely,

_an

exponential

increase

(or decrease)

of the

population

of chains.

Indeed,

if _we

deal,

for

example,

with _a model of _a

polymer

^with steric

repulsion,

the

potential v(r)

is

repulsive (positive)

at short distances and thus the replication

weight

_wn is smaller that

unity. Thus,

iteration of the _process will result in

a decrease in the total

population

of

chains,

and

eventually

we may end up ^at some stage ^with an empty ensemble of chains.

Conversely,

if the interaction

v is attractive

(e,g.,

_a

polymer

chain in _a bad

solvent),

the

replication weight

_wn is

larger

than

I, leading

to _an

exponential

increase of the

population.

This also _causes

computational problem,

since the available computer memory is finite.

However,

the

problem

can be

easily

handled if _one recalls that all _one needs is _a

population

in which each chain is

replicated proportionally

to its Boltzmann

weight.

4.4 POPULATION CONTROL. Instead of

replicating

each chain with

a factor

wn+i(rn+i

[ri>

>rn) (Eq.(27)),

it is

perfectly legitimate

to

replicate

it with _a factor

gn+iwn+i

(rn+i(ri;.,rn

where _gn+i is _an

arbitrary scaling

factor which _can be

adjusted

_so as to

keep

the

population

of chains under control.

Equation (27)

becomes

fi4n+1 (ri,

, rn, rn+i = gn+iwn+i

(rn+i (ri,

, rn fi4n

(ri,

, rn

(28)

The _new

population

of chains has the size of

Mtot ₌

~j _fi4n+1 _(ri,

,

rn,

rn+i)

₌ _gn+i

~j

_wn+i

_(rn+i (ri,

, rn fi4n

(ri,

,

rn). (29)

chains chains

(13)

From this _we _see in which _way _one should choose _gn _so _as _to

keep

the

population

under control. The iteration of

equation (29)

for _a chain of size N

yields:

fidN

(ri,

,

rN)

" gig2g3...gN eXP

~fl ~

_(~i+1

ri

a)~ fl ~

V

(~ii

_~i

lfiii

=~

i<I,j<N

(30) (where

we set gi =

I). Equation (30)

proves that the final

population

is indeed Bolt2mann-

distributed.

Note that the

product

of _g;

provides

_a

simple

^evaluation for the free _energy.

Indeed, sumlring equation (30)

_over all chains of the

ensemble,

_we obtain

N

MN _"

fl

_{g;Z MI}

₍₃₁₎

;=1

and the free energy is

given by

~~ ^N

F ₌

j (log<

⁺

Slog

_g;

(321

,=1

In

practice,

the

scaling

factors _gn _can be determined in two ways:

for

simple problems (polymers

in

good

_or bad

solvents),

one can use gn+i " gn and

adjust (increase

_or

decrease)

_gn+i in the _case when the

population Mn+i

gets out of _some fixed _range Mmin <

Mn+i

< Mmax.

for _more

complicated problems (e.g., proteins),

it is

preferable

to make _a trial _run of

adding

the

(n

+

I)-st

atom at each stage with the

scaling

of gn+1 " 7n+1, where 7n+1 is _a property chosen

factor,

count the total

population

^of ^chains

M(+i,

and then make the actual _run with

MI g"+1 " 7n+1

~,n+1

so as to _conserve

approximately

the initial

population

MI In this

work,

_we chose _7n+1

" gn+i

Thus,

_every time _we add _an atom, _we

adjust

the

scaling

factor _so _as to conserve the total number of chains.

4.5 THE GUIDING FIELD. Assume that the elastic constant kb in

equation (23)

is

large.

Then,

if _we distribute rn+i

uniformly,

the factor kb

((rn+i rn[ a)~

^will ^be

large,

and the

replication weight

_wn+i in

Eq.(26)

small.

Thus,

the

sampling

will be _very

inefficient,

since it will

assign

a very

large scaling

factor _gn+i to the _rare

configuration

for which

[rn+i

_rn

~ a,

leaving

_a _very small

weight

to other

configurations.

In other

words,

if _one

configuration

is such that

[rn+i rn[

_~ _a, then it will be

replicated

_a

large

number of

times,

while the others will be deleted from the ensemble. This results in _a deterioration of the

quality

of the ensemble

and _a

buildup

of correlations _among the

chains,

I-e- _many chains

redundantly

^follow similar

paths

in

configurational

_space.

This

difficulty

_can be avoided. The

replication weight

for the atom

(n

₊

I)

is of the form Wn+I

(rn+I lrli

_irn

# gn+I ~XP

(~fIAE (~"+l l~li

_~n ₎₁

(331

(14)

where AE is the _energy cost of

adding

the atom

(n

+

I).

This

equation

_can be factorized _as follows:

wn+i

(rn+i in..

_,n~

=

p~+i ~r~+i) g~+i

_exP

(-fl§[jjjjj/jj>...>rn11)

^~~4)

where the function

Pn+i (rn+i (ri,

_,rn is _an

arbitrary probability

distribution. The _prc- cedure is _now

simple:

draw rn+i with the

probability

distribution

Pn+i>

and

replicate

it

gn+i@

_times. _In _what

_follows,

we write

Pn+i

_c~

exp(-flvn+i),

and _we call

Vn+i

the

»+i »+i

guiding

^field.

It _can be

easily

_seen that this

procedure

indeed _conserves the Boltzmann distribution. It is also clear that statistical

independence

is best achieved when the

replication

factor is close to

unity.

For

example,

for the linear

polymer

chain it _seems natural to take

Pn+i (rn+i)

_c~ _exp

-fl~~ _([rn+i

rn

a)~ (35)

2

that is, to draw _rn+i with the _correct Gaussian distribution. Then the

(n

+

I)-st

atom is

sampled

at _a correct distance from _rn, and the residual

replication weight

will be closer to one.

The ideal choice for the

sampling

function would be

Pn+i (rn+i)

_c~ _exp

(-fIAE (rn+i(ri>..

_>rn

)) (36)

which would lead to unit

replication factors,

and thus _a

completely

uncorrelated statistical ensemble.

However,

in the presence of

twc-body interactions,

there _are _no known

techniques

for

sampling

distributions like

(36).

The

optimal

choice for the

sampling

function Pn is

Pn+i (rn+i)

_CC _exP

(-flUn+i (rn+1)) (37)

where

Un+i (rn+i)

is the _mean

potential

_seen

by

the atom

(n

+

I).

But in

general,

the de- termination of this _mean field is difficult, and _one must resort to intuition in the choice of

Pn.

At this

point,

it is of interest to note that another _use of the

guiding

^field is to introduce _an extra

potential

term to bias the MC

procedure

if _one wishes _to

directly incorporate experimental (or other)

information into the search. One _may, for

instance, guide

the

sampling, using

Ramachandran's

plot

information [1].

4.6 THE RESCALING PROCEDURE. Even if the choice of Pn is

nearly optimal,

the

replica-

tion factors _are not

strictly

equal to one, and

following

the

argumentation given

in

(4.3) above,

correlations between _the chains build up in the statistical ensemble. This effect becomes _more

important

_as the chains become

longer.

The final number of uncorrelated chains in the ensemble is

proportional

to the

population

of chains.

Depending

_on the temperature, ^the ^form ^of ^the

interaction,

and the size of the

chain,

it _may be _necessary to consider _very

large

ensembles to get ^sufficient

configurational sampling.

In reference [35] ^for instance, it _was shown

that,

for the _case of

polyelectrolytes

in

good

_or bad

solvents, good

statistics _are achieved when the

population

M is of the order of10 times the

length

of the chain.

It _can be _seen that

algorithmically (not taking

into account the

possibilities

of vectorization

or

parallerization)

the

computational

time scales _as M N~.

Thus,

for short

chains,

it is

possible

to use

large populations,

whereas for

large chains,

_one _can

only

_use small