Critical Assessment of Predicted Interactions at Atomic Resolution

(1)

UNIVERSITE LIBRE DE BRUXELLES FACULTE DES SCIENCES

Service de Conformation des Macromolécules Biologiques et de Bioinformatique. Centre de Biologie

Structurale.

Dissertation présentée en vue de l’obtention du grade de Docteur en Sciences

Critical Assessment of Predicted Interactions at Atomic Resolution

Raúl Méndez Giráldez Bruxelles, September 2007

Composition du Jury:

Prof. Louis Droogmans (Président)

Prof. Vincent Raussens (Secrétaire et Rapporteur) Prof. Jacques van Helden (Directeur Labo.)

Prof. Shoshana J. Wodak (Superviseur)

Prof. Valérie Ledent (Examinateur)

Prof. Martine Prévost (Rapporteur)

Prof. Joel Janin (Extérieur)

(2)

Molecular Biology has allowed the characterization and manipulation of the molecules of life in the wet lab. Also the structures of those macromolecules are being continuously elucidated. During the last decades of the past century, there was an increasing interest to study how the different genes are organized into different organisms (‘genomes’) and how those genes are expressed into proteins to achieve their functions. Currently the sequences for many genes over several genomes have been determined. In parallel, the efforts to have the structure of the proteins coded by those genes go on. However it is experimentally much harder to obtain the structure of a protein, rather than just its sequence. For this reason, the number of protein structures available in databases is an order of magnitude or so lower than protein sequences. Furthermore, in order to understand how living organisms work at molecular level we need the information about the interaction of those proteins. Elucidating the structure of protein macromolecular assemblies is still more difficult. To that end, the use of computers to predict the structure of these complexes has gained interest over the last decades.

The main subject of this thesis is the evaluation of current available computational methods to predict protein – protein interactions and build an atomic model of the complex.

The core of the thesis is the evaluation protocol I have developed at Service de Conformation des Macromolécules Biologiques et de Bioinformatique, Université Libre de Bruxelles, and its computer implementation. This method has been massively used to evaluate the results on blind protein – protein interaction prediction in the context of the worldwide experiment CAPRI, which have been thoroughly reviewed in several publications [1-3]. In this experiment the structure of a protein complex (‘the target’) had to be modeled starting from the coordinates of the isolated molecules, prior to the release of the structure of the complex (this is commonly referred as ‘docking’).

The assessment protocol let us compute some parameters to rank docking models according to their quality, into 3 main categories: ‘Highly Accurate’, ‘Medium Accurate’,

‘Acceptable’ and ‘Incorrect’. The efficiency of our evaluation and ranking is clearly shown, even for borderline cases between categories. The correlation of the ranking parameters is analyzed further. In the same section where the evaluation protocol is presented, the ranking participants give to their predictions is also studied, since often, good solutions are not easily recognized among the pool of computer generated decoys.

An overview of the CAPRI results made per target structure and per participant regarding the computational method they used and the difficulty of the complex. Also in CAPRI there is a new ongoing experiment about scoring previously and anonymously generated models by other participants (the ‘Scoring’ experiment). Its promising results are also analyzed, in respect of the original CAPRI experiment. The Scoring experiment was a step towards the use of combine methods to predict the structure of protein – protein complexes. We discuss here its possible application to predict the structure of protein complexes, from a clustering study on the different results.

In the last chapter of the thesis, I present the preliminary results of an ongoing study on

the conformational changes in protein structures upon complexation, as those rearrangements

pose serious limitations to current computational methods predicting the structure protein

complexes. Protein structures are classified according to the magnitude of its conformational

re-arrangement and the involvement of interfaces and particular secondary structure elements

is discussed. At the end of the chapter, some guidelines and future work is proposed to

complete the survey.

(3)

ACKNOWLEDGEMENTS

I would like to express my deep gratitude to my family, to my mother Carmen and my sister Rocío, for bearing me for almost a year at home with them, without any other due other than finalize this thesis. In particular I would like to dedicate the whole work of this Thesis in the memory of my father, José, who could not see the neither the end of this work nor the end of my days in Belgium, as a consequence of a fatal disease. Wherever he is, his critical and great analytical skills will always be with me.

My adventure in Brussels would have never been possible without the joining of many people efforts. First of all, I would like to thank Prof. Jesús Giraldo at UAB, Barcelona, who undertook me initially when my scientific career was less than a seed and not so many would have taken that risk; put me in touch with Prof. Shoshana J. Wodak, formerly at ULB, Brussels; and let me to start the Ph. D. with her, at a expense of getting no benefit for himself.

I would like to thank also Prof. Shoshana Wodak, my promoter, for her great ability of getting financial support for my project as well as for the whole lab, and accepting me in her group. I will never forget my first colleague in Brussels, Dr. Leonardo De Maria, formerly at the SCMBB lab and currently in Novozymes, in Copenhagen, who baybisitted me during my first year of work, at a personal and professional level. I would also like to thank Prof. Jacques van Helden, currently the head of the Laboratoire de Bioinformatique des Génomes et des Réseaux (formerly the SCMBB lab), for his scientific, technical and human support, always willing to help people no matter the degree of personal involvement required. Prof. van Helden also lent me the laptop on which I have written all the manuscript of this thesis and used to travel with it to several different countries already. However life at the SCMBB in Brussels as we knew it would have been almost impossible or at least completely different without the help of Dr. Raphaël Leplae, who has spent a large part of his life setting up, maintaining and even getting computer machines for the lab, as well as training patiently Ph.

D. students like me in computer programming.

Obviously the CAPRI experiment could not have been set up without the important contribution of its management committee, in especial Prof. Joel Janin at Université Paris- Sud; and Prof. Shoshana Wodak, currently at the Hospital for Sick children, Toronto. At this point, I would also like to thank the crystallographers who have submitted the structures of a number of targets in CAPRI, because without their collaboration, no experiment of docking could have been possible.

Next I would like to thank my colleagues at the SCMBB, in Brussels, for make me always feel accepted, sharing friendship and fun out of the work environment. In particular I would like to thank: my friends Gipsi and Abel for lots of personal and professional support they have provided me, including hosting many times the current and the previous year, so that I could spend short terms in Brussels and solve several administrative matters linked to the thesis as well as having meetings with Prof. Wodak any time she passed by Brussels; Dr.

Didier Gonze, former member of the SCMBB laboratory and now at Unité de Chronobiologie

Théorique, ULB, who shared office, fun, and some adventures with me and provided logistic

help in some of my several moves; Karoline, for her enthusiasm in either scientific or cultural

discussions as well as help in my moves; Benoît, for his sharp scientific remarks and humor,

as well as sharing fun and drinks at some parties; Dr. Marc F. Lensink, for sharing office and

scientific discussions with me; Rekin’s, for his sociable and sportive spirit; Sylvain, for his

favors concerning ULB administration as well of his easy-going character; Prof. Ariane

Toussaints, for her interesting conversations and ideas during seminars; Fernanda and

Sebastian, for scientific and moral support; Didier Croes, who took his Ph. D. at the SCMBB

under Prof. Wodak too and now is at BIOXPR, for his training in databases and his always

(4)

support and more important, for organizing something as international, great and healthy as our weekly soccer match during Spring and Summer time, which joint together work mates and his own friends, accounting for 6 different nationalities.

Still in Belgium, formerly in Brussels, currently in Gent, my especial thanks go to my friends Erick, Cecilia and their son Nicolas; they have been always there to share fun, sports, parties, social life and support in each and every moment I have spent in Belgium or abroad.

Of especial mention is my friend Giacomo, in Milano, to whom I would like to thank his collaboration in the work of the third chapter, and also his friendship during the year we worked together in Brussels.

Last but not least, I cannot end without thanking my friends in Barcelona, who I have

abandoned a bit because of the Thesis, in special: Susana, for making me go out from time to

time and get unstressed during the writing of this manuscript; David who also re-introduced

me in the social life of Barcelona (well, he had tried nevertheless); the group of Míriam,

Óscar, Gemma, Joan, Núria, Elisabeth and Lourdes, for having the courage of visiting me

every year in Spring time, while I was living in Brussels, and discover Belgium together.

(5)

“No man is an island, entire of itself; every man is a continent, a part of the main.”

John Donne, 1572 - 1631.

“… everything that living things do can be understood in terms of the jiggling and wiggling of atoms.”

Richard P. Feynman, 1918 – 1988.

(6)

SUMMARY... 2

ACKNOWLEDGEMENTS... 3

TABLE OF CONTENTS... 6

PREFACE. ... 9

1. INTRODUCTION ... 12

1.1. Protein Sequence, Structure and Function. ... 12

1.2. The Structural Basis of Protein – Protein Recognition. ... 18

1.2.1. Interface Area... 19

1.2.2. Crystal Contacts versus Biological Relevant Interfaces... 19

1.2.3. Protein – Protein Complexes. ... 20

1.2.4. Characterizing the Surface Complementarity and Atomic Packing. ... 20

1.2.5. Chemical Composition and Interactions. ... 22

1.2.6. The presence of water molecules at protein – protein interfaces. ... 23

1.2.7. Structural Similarity Measure: Root Mean Square Deviation. ... 23

1.3. The Role of Protein Docking in Computational Biology... 25

1.4. General Protein Docking Protocol. ... 28

1.5. Protein Flexibility: Background ... 30

2. THE CAPRI EXPERIMENT ... 32

2.1. Definition and Main Motivation... 32

2.2. CAPRI Submission and Evaluation Protocol ... 33

2.2.1. The Docking Submission... 33

2.2.2. The Scoring Experiment. ... 34

2.2.3. Evaluation Protocol... 36

2.2.4. Protocol Implementation Details... 37

2.2.4.1. Identification of the Interfaces and Conformational Changes in the Target... 38

2.2.4.2. Identifying the Interacting Subunits, and Receptor and Ligand Moieties in the Predicted Complexes... 40

2.2.4.3. Residue Contacts, Residues at the Interface and Atomic Clashes. ... 40

2.2.4.4. Defining Common Fragments and Residue Equivalences between the Target and all Predicted Complexes. ... 41

2.2.4.5. Scoring the Residue Contact and the Interface Residue Lists... 42

2.2.4.6. Computing Rigid Body Fits and Related Parameters... 42

2.2.5. The target structures... 43

2.3. CAPRI Results ... 47

2.3.1. Illustration of the Performance of the Evaluation Method... 47

2.3.1.1. ‘Highly Accurate’ versus ‘Medium Accurate’ Model Discrimination. ... 47

2.3.1.2. ‘Medium Accurate’ versus ‘Acceptable’ Model Discrimination... 49

2.3.1.3. ‘Acceptable’ versus ‘Incorrect’ Model Discrimination... 51

2.3.2. The Complementarity of the Ranking Parameters... 53

2.3.3. Towards Evaluating Conformational Adjustments. ... 56

2.3.4. Assessing how well Predictors Rank their Models... 57

2.4. Summary of the Results... 60

2.4.1. Overall Performance of Docking Results per Target. ... 60

2.4.2. Improvement of the Docking Models after Rescoring. ... 64

2.4.3. Overall Performance of Participants. ... 65

2.4.3.1. Results from Participants who Contributed all Targets in Rounds 1 - 5 ... 65

2.4.3.2. Results from Participants who Contributed 15 –13 Targets in Rounds 1 - 5 ... 69

(7)

2.4.3.4. Results from Participants who Contributed 7 Targets in Rounds 1 - 5 ... 73

2.4.3.5. Results from Participants who Contributed 6 - 5 Targets in Rounds 1 - 5 ... 75

2.4.4. The Results of Combined Docking Methods: Towards the Use of Meta-servers. ... 78

3. ANALYSIS OF CONFORMATIONAL CHANGES IN PROTEIN – PROTEIN COMPLEXES. ... 82

3.1. Summary. ... 82

3.2. Current Approaches to Study Flexibility in Protein Structures. ... 83

3.3. Materials and Methods. ... 85

3.3.1. Building the Dataset of Bound and Unbound Protein Structures. ... 85

3.3.2. Reducing the Dataset Size and Minimizing Redundancy. ... 86

3.3.2.1. Minimizing Redundancy in the Dataset of Protein Complexes. ... 86

3.3.2.2. Minimizing Redundancy in the Unbound Structures. ... 87

3.3.3. Distinguishing Meaningful Conformational Changes from Errors in Atomic Coordinates. ... 89

3.3.4. Comparing Bound – Unbound Structures: the Structural Alignment Protocol. ... 91

3.3.4.1. Searching for a First ‘Static Core’... 92

3.3.4.2. Re-defining the Static Core. ... 93

3.3.4.3. Recovering Difficult Alignments. ... 94

3.3.5. Identifying Local Backbone Conformations... 94

3.3.6. Quaternary Structure for the Protein Complexes: Interface Definition. ... 94

3.3.7. Correlating Local Backbone Deformations with Various Structural Features... 95

3.4. Results Obtained So Far... 98

3.4.1. Classifying Protein Structures According to their Overall Plasticity. ... 98

3.4.2. Mapping the Different Residue Features onto the 3D Structures. ...102

3.4.3. Mutual Associations Between Backbone Plasticity, Structural and Functional Features...104

3.5. Where do we go from here? ... 107

4. CONCLUSIONS ... 109

5. REFERENCES ... 116

6. ANNEX ... 132

6.1. Amino acids and Their Chemical Formula... 132

6.2. Molecular Modeling Tools... 133

6.2.1. Molecular Mechanics Approach to the Potential Energy of a System. ...133

6.2.2. Molecular Simulation Techniques...135

6.2.2.1. Molecular Dynamics...135

6.2.2.2. Monte Carlo Simulations...136

6.2.2.3. Simulated Annealing...138

6.3. Main Protein Docking Methods... 140

6.3.1. First Protein – Protein Docking Algorithm: Wodak and Janin. ...140

6.3.2. Docking Methods Based on FFT sampling...141

6.3.2.1. The Katchalski-Katzir First FFT Algorithm. ...142

6.3.2.2. MolFit by Miriam Eisenstein...144

6.3.2.3. DOT by Lyn Ten Eyck. ...148

6.3.2.4. 3D-DOCK by Michael Sternberg...149

6.3.2.5. ZDOCK by Zhiping Weng. ...153

6.3.2.6. HEX by David Ritchie...157

6.3.3. Docking Methods Based on Force Fields. ...159

6.3.3.1. ICM-Disco by Ruben Abagyan. ...159

6.3.3.2. RosettaDock by Jeffrey Gray and David Baker...162

6.3.3.3. SmoothDock by Carlos Camacho. ...165

6.3.3.4. HADDOCK by Alexandre Bonvin...167

6.3.3.5. ATTRACT by Martin Zacharias. ...169

6.3.4. Other Docking Methods. ...171

6.3.4.1. PatchDock, FlexDock and SymmDock by Haim Wolfson. ...171

6.4. Official Publications of The CAPRI results. ... 173

(8)

6.5. List of Protein Complexes with the Corresponding Unbound Partners ... 210

(9)

Critical Assessment of Predicted Interactions at Atomic Resolution.

PREFACE.

We can see during the history of science, Biology in particular, a progress from a descriptive period to a more quantitative, predictive and theoretical stage. Beginning in ancient Greece, Aristotle (4

^th

century BC) until 18

^th

century, including the work of the Swedish naturalist Linnaeus, Biology was mainly a descriptive science (please, I am not saying it was not important), a classification of animals and plants (a ‘Natural History’). In 18

^th

century Chemistry had started to be expressed using mathematical formalism (Boyle, Lavoisier), and progressively leaving ‘Alchemy’. During 19

^th

century there was an increasing interest on natural products, the German chemist Wöhler demonstrated that urea could be synthesized chemically, and so natural and chemical products were essentially the same. Also during the same period of time, biologists had begun to look at microorganisms such bacteria through the help of microscopes.

Later on, these works led to the identification of many infectious microbes, thanks, for example, to the work of Pasteur and Koch. The cell was established as the minimum unit of life for animals and plants, while physicists established the atom as the constituent unit of matter. So, as Science was progressing, scientists looked more in detail at our surrounding environment, to elucidate their ultimate agents, constituents or particles. During 20

^th

century the advances in physics, especially in the atomic particle knowledge and the laws that govern them, preclude the development of new spectroscopic (that help gaining insight in molecules and atoms) and microscopic techniques (that allowed to determine the sub cellular compartments and organelles).

It was during 20

^th

century when the great advances in Biology, partially influenced by the achieved knowledge in Physics and Chemistry, ended up with the isolation and characterization of the molecule coding the information of life: the Deoxy Ribonucleic Acid (DNA), whose three dimensional structure was elucidated by Watson and Crick in 1953 after analyzing X-ray diffraction patterns of those molecules. This ‘breakthrough’ changed completely the way biologist looked to biomolecules: now the third dimension came to age.

Six years later, the three-dimensional structure of the first protein, that of myoglobin, was elucidated by Perutz and Kendrew. Since then a new interest in Biology begun to determine the structures of biomolecules, mainly nucleic acids and proteins. In subsequent years, biologists learnt how to manipulate these molecules in the lab: ‘Molecular Biology’ was born.

Simultaneously the study of the genetic information in cells aimed to discover the genes, those DNA fragments coding for the sequence of single proteins.

Molecular genetics was in fact ‘gene hunting’, involving big research teams and

laboratory machinery to characterize single genes. Consequently during the 80’s and 90’s the

genes causing several diseases were isolated, e.g. Huntington’s disease, which provided also a

great medical success. Analogously, single protein structure determination also involved

considerable amount of work and many technical handicaps. As computers and robotics were

added to Molecular Biology techniques, our interest and expectancies grew up. Hunting

single genes was fine, but knowledge of isolated genes will never provide us all the pieces of

the puzzle to understand whole cells. By the end of 90’s big public and private consortiums

inaugurated the ‘Genomic Era’, aiming to obtain the order in which nucleotides are placed in

chromosome DNA (DNA sequencing) of entire organisms, including humans. The first

human genome draft was completed by 2001, and finally in 2006 the final draft covered about

92 % of the chromosome DNA maps. In the past decade it was also very remarkable, as for

the human genome project, the massive use of computers to handle huge amount of biological

(10)

data, and make predictions out of them, based on statistical models. Yet another field was born: ‘Computational Biology’.

Still after this enormous amount of technical, human and economic effort, we do not have the collections of all the human genes (we do not even know the exact number of them, about 25,000 to 30,000 according to experts), as we presumed one decade ago. There are, however, computer tools to predict them [4] requiring a lot of ‘a posteriori’ experimental checking work. The human genome is not the only one that has been sequenced; tens of others have been or are being. Those initiatives are producing an enormous amount of DNA sequences, coding for proteins.

As a consequence, we seem now to be in the so-called ‘Post-genomic era’, hence we would like to know the three dimensional structure of all the myriads of existing proteins, and not just their sequences, as the highest level of knowledge. ‘Structural Genomic’ programs, which are devoted to solve the structure for the product of each individual gene, are distributed world-wide per organisms and will take long time to produce significant amounts of coverage of full ‘proteomes’, because of the experimental difficulties to obtain crystal structures of all the proteins.

However even having the structures of all the possible proteins available, will not give us a full understanding of living cells: it would be equivalent as to have a computer disassembled into its component parts, we need to know which parts are interacting and how they do, in order to assemble the computer back. Interestingly, the results of the human genome projects have revealed a hypothetical number of genes two orders of magnitude lower than expected, not so different from the believed number of genes of a much simpler organism such as C. elegans. What are, then, the determinants of the complexity that makes us, human beings, to be so different compared to worms? Probably there is no clear answer to this yet, but molecular biologists believe that the difference in complexity might be related to the different number (and type) of protein – protein (perhaps protein – nucleic acids) interactions in both organisms, much more abundant in complex organisms, such humans.

Protein – protein association is the subject of this thesis, in particular, the atomic characterization of these interactions. Despite the big undergoing structural genomic projects, we still miss structural data on protein – protein complexes that will ‘presumably’ contribute to reveal the secrets of the molecular machineries operating in cells, and also help us ultimately to understand the underlying molecular mechanisms of diseases. That is the reason for many computational methods predicting the structure of a protein – protein complex to appear. Most of these methods will be described in the Annex section, but the core of the thesis is indeed the objective evaluation of their results as a part of a big international experiment, in which several structures of unknown protein complexes were proposed for prediction and different groups from different countries submitted their best predictions. The main purpose of the thesis is, thus, the implementation of a protocol to automatically evaluate the results of these methods; and make comparisons of their performances.

As part of this exciting trip towards one the most recent and multidisciplinary fields of science, in this thesis protein structures will be first introduced, to illustrate in a subsequent section the structural features of protein – protein association. Next, the main computational approaches to predict protein – protein interactions will be presented, with special emphasis on the methods that provide three-dimensional models of protein – protein complexes provided the coordinates of the unbound proteins, showing a general scheme used by most of these methods. At the end of the introductory chapter, there is a background section about protein flexibility that is linked with the third chapter.

The second chapter of the thesis is dedicated to present the CAPRI experiment on protein

– protein structure prediction and its evaluation. The details about the evaluation protocol and

(11)

Critical Assessment of Predicted Interactions at Atomic Resolution.

its computer implementation, which are actually my own contribution, are rigorously described there, together with the list of target complexes and their inherent difficulties. In the same chapter it also analyzed the efficiency of my evaluation protocol to correctly rank docked models, even for border-line cases through a series of examples; a correlation analysis of the evaluation parameters; and the participant ability to model protein flexibility, as well their efficiency discriminating the good predictions from the false positives in their submission set by their rankings. Readers are referred to the publications in the Annex for details about the complete results on the 5 editions of the CAPRI experiment that have been already surveyed. At the end of the chapter, the quality of the protein – protein predictions for all the evaluated models is discussed, including CAPRI latest rounds assessed by Marc F.

Lensink in our group

¹

that are currently being surveyed in a ongoing publication (Lensink MF, Méndez R, Wodak SJ, in press), and are already publicly accessible at the European Bioinformatics Institute web site; as well as the quality of the models after mixing participant’s top-100 contribution and of those submitted after re-scoring the uploaded pool (a new experiment within the CAPRI initiative). In the same section a more in detail analysis about the results per participant, just over the up-to-day published 5 rounds, regarding the methods they have employed, is described. At the end of the section the possibility of using combined docking methods to predict the structure of protein – protein complexes is addressed.

The third chapter is dedicated to study the phenomenon of conformational changes in proteins upon binding to other proteins, given the fact of its importance in protein – protein complex structure prediction. This work began initially as a collaboration with Giacomo De Mori and Giorgio Colombo from the Istituto di Chimica di Riconoscimiento Moleculare, Milano, Italy. The chapter contains indeed part of an ongoing study: introduction of the methods to analyze protein conformational changes; the selection of the dataset; the presentation of the structural analysis method to compare the bound structures to their corresponding unbound and the different criteria used; and some preliminary results.

At the end of this manuscript there is an Annex that contains the description of some important concepts and techniques used in Computational Biology, the description of the most important computational methods to predict protein complexes in CAPRI, reprints of our publications on CAPRI results, and the list of pair of proteins bound – unbound used in our conformational change study.

1

Marc’s evaluation protocol is based essentially on mine, with small implementation differences.

(12)

1. INTRODUCTION

1.1. P ^ROTEIN S ^EQUENCE , S TRUCTURE AND F ^UNCTION .

Protein molecules are a special kind of polymers, made of repetitions of amino acids (the constitutive chemical monomer), covalently bound through an amide bond generically termed as “peptide” bond. Thus, a protein can be chemically regarded as a very complex polypeptide.

All the naturally occurring amino acids have the same stereochemistry (referred as ‘L’) at the alpha-carbon, but they differ considerably in their side chains, which can be of different sizes, shapes, hydrogen-bonding capabilities and charge distributions that confer proteins the vast amount of biological functions required by living organisms. Figure 56 in Annex 6.1.

compile the chemical features of the 20 biological synthesized (or “essential”) amino acids.

We will not explain here the whole process about how proteins are synthesized in vivo.

Following the ‘Central Dogma’ in Molecular Biology, proposed by Francis Crick in 1970 [5], Deoxy Ribonucleic Acid (DNA) can self-replicate, transcribed into Ribonucleic Acid (RNA).

RNA can travel from the cell nucleus to the cytoplasm, where it can be translated into proteins with the help of ribosomes and ‘transfer RNA’ (tRNA). RNA itself can revert to DNA molecules through a ‘reverse transcription’, as it happens in retroviruses; and it can also self-replicate.

The biological function of a protein depends ultimately on its three-dimensional structure, more precisely on the conformation(s) it can adopt. In general, proteins exist in a single native state. This native state is reached under physiological conditions found in living cells (aqueous solvent near neutral pH at 20 – 40° C) during the folding process (almost instantaneous). In a series of experiments Anfinsen was able to denature and re-fold in vitro the RNase enzyme, hence he concluded that protein sequence (the order of the amino acids in a protein from its N- to its C-terminus) encodes the information for a protein to fold [6, 7].

There are exceptions to the previous statement: some proteins exist in solution with a non- defined three-dimensional structure, just as an ensemble of conformations. They are called

‘natively unfolded’ proteins [8].

However protein native states are usually in thermal equilibrium at physiological conditions with other less stable states (or conformations) related through energy barriers.

Sometimes the energy barriers are not big enough to prevent the corresponding states to be significantly populated at room temperatures. When the different conformers are energetically not so different to the native state, the equilibrium can be altered by the binding of other molecules or macromolecules and/or by different pH or solvation conditions. The latter constitute the physical explanation for the conformational changes occurring in protein structures that will be analyzed in chapter 3.

Typically two major experimental techniques have been widely used to elucidate the

structure of proteins and macromolecules in general: X-ray crystallography [9] and Nuclear

Magnetic Resonance (NMR) [10]. Both techniques have advantages and disadvantages. NMR

provides some hints on the dynamics of a macromolecular structure at expenses of lower

resolution, higher indetermination in the atomic positions and limited size of the protein to

analyze (smaller than 30,000 kDa). On the contrary, X-Ray crystallography has longer

tradition in protein structure determination, providing better resolution of larger

macromolecules (up to 10

⁶

kDa), at expense of loosing the dynamic properties of the system

due to the experimental restrictions to allow crystal formation. From here on, ‘protein

structures’ will be referring to X-Ray protein structures, unless special mention of the NMR

technique is made.

(13)

Critical Assessment of Predicted Interactions at Atomic Resolution.

Unfortunately data on protein structures are being always generated slower than on protein or DNA sequences, especially nowadays, where the genome sequencing programs have provided genome sequences for entire organisms. Then if genes encode protein sequences, these sequences determine protein structures, and protein structures ultimately determine protein function, then it must be possible to infer protein function from protein sequences provided all these sequenced genomes. That is the aim of ‘Functional Genomics’.

In practice this is far from trivial due to several reasons. First of all, genome-sequencing projects do not provide gene sequences, but DNA sequences, so discovering genes is another problem itself, as I mentioned in the Preface. Second, even knowing the sequence of a protein we still cannot predict its complete three-dimensional structure, because the ‘folding problem’

has not yet been solved. However there are a lot of computational tools that can help in many cases to infer the three dimensional structure of a protein from its sequence. It is not the scope of this text to explain them all. Briefly, they are mainly based on comparisons to proteins of similar sequence whose structure is known (‘comparative modeling’); or on very intensive calculations to simulate in computers the folding of a protein under the physic laws (‘ab initio’ modeling) [11]. In any case, all these structure prediction methods must be handled cautiously: the quality of the models are strongly conditioned by the resemblance of the native to the template structure in comparative modeling; and by the sampling methods and energy functions in ab initio modeling.

It seems, thus, not so reliable try to predict function from just only sequences, then why do not we do a bit more effort (i.e. crystallographers and/or NMR people) and put protein structure elucidation in a pipe-line? Well this is indeed the aim of the large-scale ‘Structural Genomic’ projects. But again so often it is no the end of the story. Protein crystal structures produced under these large and worldwide initiatives appear frequently isolated, without the presence of any ligand or macromolecular interacting partner, hence not providing a direct clue of its function. Again computers can also help to compare unknown structures to known ones of similar topology to infer function (using for example, structural alignment programs, as the one described in section 2.2.4.1. and 3.3.4.). But still the new paradigm of Structural Genomics is the discovery of new protein structures unrelated to known ones. If that situation occurs, there are computational tools to predict functional sites from the isolated structure of single proteins based on energy stability/instability criteria [12]. However we must recall here that a given protein may have a different function, depending on the environment it is expressed and/or the cellular needs. Those multifunctional proteins are called ‘moonlighting proteins’ [13]. Moreover, if two proteins are known to interact and we can get by any mean the structure of the isolated proteins, there exist another set of methods to model the structure of the corresponding complex that will be introduced in the next section, and analyzed further in the rest of this thesis.

Both, Functional Genomics and Structural Genomics, combine computational methods (Computational Biology) to collect, manage, and analyze biological data, through statistical and/or physicochemical computations are made in order to extract biological relevant features that hightroughput experiments are producing.

Let us look back to the protein structures themselves, the main objective of this subsection. How do they look like? As a result of 50 years of protein X-Ray crystallography, the structures of 38,033 are deposited and publicly available in the Protein Data Bank

²

(http://www.rcsb.org/pdb/home/home.do) [14]. They have revealed that proteins do not adopt

2

This number of protein structures refers to the July 07 release.

(14)

regular or completely symmetrical structures (although protein aggregates do), but they are more complex. The topology of a protein structure can be really complicated, but it can be described as the arrangement of certain three dimensional motifs. Due to their hydrogen bond

³

pattern, polypeptide chains can adopt 2 major motifs helices and strands (Figure 1). There are several types of helices depending on the size of the repeating turn. The most common type of helices in proteins is the α -helix, in which the NH amino group of the residue i forms a hydrogen bond to the O=C carbonyl group of residue i+4. Other type of helices are: the 3

10

helix, where the hydrogen bond is between the NH of the residue i and the C=O of the residue i+3; and the π-helix, where the hydrogen bond is between the NH of residue i and the C=O of residue i+5. β-strands are rather made of residues in more extended conformation. β-strands are usually hydrogen-bonded each other, in a parallel or antiparallel way, forming β-sheets (Figure 2). Helices and strands constitute the ‘secondary structure of proteins’. Secondary structure elements are connected, by some elements referred generally as ‘loops’, which are less regular. Secondary structure elements with loops can form more complex topological motifs termed ‘Supersecondary structure’. If we continue higher in the hierarchy and extend the supersecondary structure, we end up in the full description of the three dimensional protein structures or ‘Tertiary structure’. Still, if several monomers of the same or different proteins are interacting as an assembly or protein complex, we refer to it as the ‘Quaternary structure’ (or the spatial arrangement of several tertiary structures). A typical protein structure rendered in ribbons is shown in Figure 3. This picture depicts how the different secondary structure elements can be combined into different super-secondary structure motifs (beta- barrel for instance) to give the tertiary structure of a monomer of the Triosephosphate isomerase. Two monomers of this protein form the quaternary structure of the enzyme (Figure 3).

When analyzing protein structure we usually refer to two different kinds of atoms:

backbone and side chain atoms. The first group of atoms corresponds to those on the main polypeptide chain common to all amino acid types, i.e. N, C

α

, C, O; while side chain atoms are those bound covalently to the C

_α

that confer the chemical specific properties to each amino acid type. The list of complete amino acids with their corresponding side chain atoms complete for the 20 essential amino acids is shown in Figure 56, in Annex 6.1. Side chains can be grouped according to their electrostatic and hydrophobic/hydrophilic character, in acidic side chains, with a terminal acid group, negatively charged at physiological conditions (Aspartic acid, Glutamic acid); basic side chains, with a terminal basic group positively charged at physiological conditions (Lysine, Arginine, Histidine); polar side chains, with terminal amide group (Asparagine, Glutamine); small side chains (Glycine, Alanine, Serine, Threonine); medium-sized and large hydrophobic (Cysteine, Valine, Isoleucine, Leucine, Proline, Phenylalanine, Tyrosine, Methionine, Tryptophan).

3

Hydrogen bonds are non-chemical bonds of electrostatic nature. They are formed when a hydrogen

atom is covalently bound to an electronegative atom, such as Nitrogen or Oxygen, hence having its

electron density rather polarized towards the electronegative atom (donor atom). If those polarized

hydrogen atoms are close enough from an electronegative atom such as oxygen (distances of 1.6 – 2.0 Å),

with an excess of negative electron density, the latter shares partially its electron density with the

hydrogen, creating a non-covalent hydrogen bond.

(15)

Critical Assessment of Predicted Interactions at Atomic Resolution.

Figure 1. The structure of α-helix (left) and β-strand (right).

Figure 2. Parallel and antiparallel β-sheets.

Figure 3. Crystal structure of the chicken Triosephosphate rendered in cartoons (PDB id 1sw0). The protein

appears as a homo-dimer, in which each monomer has the typical TIM beta/alpha-barrel fold, with the parallel

beta-sheet barrel in yellow, the helices rendered in red and the loop regions in green.

(16)

Figure 4A. The torsion angles that define the conformation of an amino acid.

Figure 4B. Ramachandran plot, for the backbone torsion angle distribution φ, ψ. Regions corresponding to the α -helices right and left handed, as well as for β - strands, are indicated. The dashed lines contain the

‘allowed’ regions assuming a trans conformation of the residues (ω = 180°). Dots correspond to the backbone torsion angles for a high-resolution crystal structure of Ribonuclease A (PDB id 7RSA). This figure is reproduced from Lesk 2003 [15].

The rotatable bonds of an amino acid residue in a protein can explain its conformation, ignoring the small variations in bond angles and bond lengths. There are three backbone torsion angles, named φ , ψ and ω (Figure 4A). The conformation of side chains are characterized by the torsion angles (starting at the C

_α

-C

_β

rotatable bond and on) χ

1

up to χ

5

, depending on the length of the side chain.

The ω angle corresponds to the amide (peptide) bond, which features some delocalized electron density that approaches it to a second order bond, hence the energy barriers for rotations away of planarity are quite high and rarely deviates from 0° or 180°. In principle ω

= 180° (trans conformation) is preferred over ω = 0° (cis conformation), except for prolines that show a high proportion of cis-peptide bonds due to the penta-cycle involving the backbone. Because of this, the backbone conformation of each residue is determined mainly by the φ, ψ torsion angles. Those angles can rotate but cannot take all the possible values because of the steric collisions they can produce.

The Indian scientists Sasisekharan, Ramakrishnan and Ramachandran predicted this result from theoretical calculations on X-ray diffraction data of polypeptide conformations at the same time independently by 1963. They also compared the theoretical φ, ψ values with the experimental measurements and demonstrated good agreement. It must be taken into account that by the time they did their torsion angle calculations, there were barely few structures of proteins completely solved

⁴

. The plot of ψ versus φ angles, for a given protein structure, is called ‘Ramachandran plot’ (Figure 4B) and show typically the zones for the torsion angle values of the common secondary structure elements, as well as their theoretical predicted

4

The first two protein structures completely solved by Kendrew, Perutz and collaborators were those of

the Myoglobin in 1959 and of the Hemoglobin in 1960.

(17)

Critical Assessment of Predicted Interactions at Atomic Resolution.

“allowed” regions. The boundaries of those regions depend on the van der Waals radii selected for each atom. As it is shown in Figure 4B, in reality there are some values outside of the Ramachandran allowed regions, e.g. Glycine residues are more tolerant to many rotations.

But what about side chain conformations? Is there anything equivalent to the

Ramachandran plots, but for side chain torsion angles? Well not really. For side chain χ

angles there are rather skewed distribution of the angle values. However the values for the

first side chain torsion angle χ

1

tend to cluster around 60°, 180° and –60° to avoid eclipsed

conformations. The conformations of any side chain corresponding to different combinations

of χ angle values are called ‘rotamers’. Side chains most of the time have a finite repertoire of

preferred conformations, as it has been demonstrated in statistical analysis of χ angle patterns

on well-determined protein structures [16]. These studies have produced sets of preferred side

chain conformations or ‘rotamer libraries’, which are thoroughly used in many Computational

Biology procedures (see Annex 6.3. for examples of applications in predicting protein –

protein interactions).

(18)

1.2. T HE S TRUCTURAL B ASIS OF P ROTEIN – P ROTEIN R ECOGNITION . In the previous section I have described briefly protein structures, and also stated that protein structures determine protein function. In this section we will review how proteins interact with other biomolecules to achieve their function at molecular level, and what are the main strategies to study these interactions from a structural and computational point of view.

Many of the most important cellular functions are carried over by proteins, they include:

regulation, enzymatic activity, immune response, signal transduction, etc.

Proteins can interact with several different partners that are small molecules or macromolecules such as other proteins, nucleic acids, polysaccharides, lipids, and composite molecules. With the results of the genome sequencing large-scale Structural Genomic projects that provide to the scientific community a set of complete protein sequences (indeed, the sequences of the genes that code for those proteins) of living organisms, the challenge now is the understanding of molecular assemblies and interacting systems.

The analysis of pairwise interactions is a first step towards this goal. Many efforts have been devoted to protein – protein recognition [17, 18]. They involve multidisciplinary tools from genetics to physical chemistry, and to computer science. This has helped to illustrate the role of macromolecular recognition in many processes such as: the regulation of enzyme activity, gene expression, signal transduction and the immune response. It has also brought up a large amount of wealthy chemical and geometrical data that have been extensively exploited by a number of web servers like the Protein–Protein Interaction server (http://www.biochem.ucl.ac.uk/bsm/PP/server/) [19, 20], as well as by the specialists in the field to rationalize the process of protein – protein recognition [20, 21].

The aim of these studies is to identify features that are most common to all the analyzed examples of pairwise protein – protein interaction. Therefore, the chemical and the geometric properties of the protein – protein interaction are expected to correlate with the thermodynamic and kinetic parameters in solution. Experimental data have been compiled extensively for several protein – protein complexes [22, 23]. The most important parameters to study protein – protein interaction at atomistic level are: the interface area, the geometric complementarity, the atomic packing, the chemical composition at protein – protein interfaces and the study of the protein solvation by water molecules.

In systems where the correlation between the structural and kinetic or thermodynamic parameter is valid, we can analyze the energetics of association and perform computer simulations of the recognition process. While still far from satisfactory, structure based theoretical approaches to macromolecular recognition represent an active field of research with a high impact in the ‘post-genomic era’. In the forthcoming years, large scale initiatives in protein structure determination are expected to fully characterize most of the proteins folds [24]. However, information on all the possible modes of specific protein – protein interactions, whose number is orders of magnitude larger than the number of folds, is being generated much more slowly.

Computer methods to predict protein – protein interaction, in particular docking algorithms, will hence be valuable tools, despite the approximations they make. The most important and widely used docking algorithms will be reviewed in Annex 6.3.

Some of the most important features in structural biology regarding protein – protein

complexes are revisited next. They are needed to understand subsequent sections of this work.

(19)

Critical Assessment of Predicted Interactions at Atomic Resolution.

1.2.1. Interface Area.

The interface area is the geometric quantity that accounts for the loose of the solvent accessible surface area upon complex formation [25, 26]. It is derived as in eq. 1; where B is the interface area, A

1

is the solvent accessible surface area of the dissociated component 1, A

2

, the solvent accessible surface area of the dissociated component 2 and A

12

is the solvent accessible surface area of the complex.

€

B = A

₁

+ A

₂

− A

₁₂

⁽¹⁾

Normally the solvent accessible surface area for the dissociated components are evaluated as taken from the complex, in their bound conformations, so that if major backbone arrangements take place between the unbound and bound conformations, eq. 1 is not valid.

Arbitrarily, interfaces areas are measured as if the conformational change, if any, happened just before protein association.

One of the most used methods to compute solvent accessible surface areas is the rolling sphere algorithm from Lee and Richards [26]. The solvent accessible surface area is the surface traced by the centre of the probe sphere as it rolls on the van der Waals surface of a protein

⁵

. In the next sections, interface area computations will be referred to this algorithm.

The radius of the solvent probe is near 1.5 Å, to mimic a water molecule. The contribution to the interface area by the individual molecules can be evaluated and it approximates to B/2.

This is especially true if the surfaces in contact do not have strong curvature. If so, the convex side contributes more interface area than the concave, because accessible surface areas are measured one probe radius away from the van der Waals surface.

1.2.2. Crystal Contacts versus Biological Relevant Interfaces.

The Protein Data Bank (PDB) contains many examples of intermolecular contacts.

The most ubiquitous are the crystal contacts [27-29]. They conform a category of non-specific and low affinity interactions, which can be seen as the background level for the study of the relevant interactions.

Janin et al. [27] in a survey of 1260 pairwise crystal packing contacts observed in a sample of 152 crystal forms of monomeric proteins, have shown that they have interface areas ranging from 100 to 2500 Å

²

, although the majority are small, about 600 Å

²

, that is 300 Å

²

per molecule, only a few percent of the normal solvent accessible surface area of a protein.

Nevertheless, each protein forms 6 –12 of such contacts and the total buried surface area can be large, sometimes even larger than the half of the protein surface area.

A few crystal pairwise interfaces lay in the 1300 – 1900 Å

²

interval, considered as the standard size for biological interfaces [21] (Figure 5). Almost all result from the presence of two-fold and other point group symmetry elements, uncommon in crystal monomeric proteins. Hence the formation of dimers and/or small oligomers in solution precedes crystallization under the conditions where these crystals are obtained, usually in the range 10

^-5

– 10

^-3

M.

5

The van der Waals surface of a molecule corresponds to the outward-facing surfaces of the van der

Waals spheres of the atom.

(20)

Crystal contacts are mainly associated with lattice translations and screw rotations not found in oligomeric proteins. Their size distribution looks like that of the transient interfaces created by random collisions in computer simulations, thus crystal contacts can be interpreted as random protein association [30].

Figure 5. Interface area distribution in protein crystals and in protein complexes. The data includes 75 interfaces in protein – protein complexes from Lo Conte et al. [21] and 1260 pairwise interfaces observed in 152 crystal forms of monomeric proteins [27]. For this second protein type of interface, 20 should multiply the vertical scale. This figure is reproduced from Wodak and Janin, 2002 [31].

1.2.3. Protein – Protein Complexes.

The reference survey for protein – protein complexes is the one by Lo Conte et al.

[21] that includes the crystal structure of 75 protein – protein complexes. The interface areas in this study ranged 1,100 – 1,500 Å

²

. It has a very well defined maximum at B ~ 1,500 Å

²

(Figure 5). This interval is remarkably different to the one corresponding to crystal contacts, i.e. the lack of interfaces below 1,100 Å

²

indicates that the formation of a stable and specific complex between two proteins requires making a sufficient number of contacts and removing water from part of the protein surface. In the dataset 70 % of the complexes had interfaces in the range 1,600 ± 400 Å

²

, which it is taken as the standard protein interface size, and 27 % had interfaces larger than 2,000 Å

²

.

1.2.4. Characterizing the Surface Complementarity and Atomic Packing.

The geometric complementarity of two surfaces in contact can be seen as a measure of how good the van der Waals contacts have been optimized. There are several ways to compute it, depending on the purpose. A classic definition is the one from Lawrence and Colman [32], eq. 2

€

S( x) = cos(u) × exp(−ω d

²

) ⁽²⁾

Here x is the set of closely related points, d is the distance from grid point x of one surface to the nearest grid point on the other surface, and u is the angle of the normal vectors to the surfaces at these two points. Grid points are defined by the program MS of Connolly [33] and the weight factor is adjusted to 0.5 Å

²

. A protein – protein interfaces contains typically many thousands of grid points and the values of S(x) scales between –1 and 1.

Normally most of the negative and low positive values come from the edge of the contact

region. Removing those in a peripheral band of 1.5 Å gives a peak in the 0.6 – 1.0 interval,

which means that molecular surfaces are in contact (small d), and their normal vectors

(21)

Critical Assessment of Predicted Interactions at Atomic Resolution.

approximately parallel (small u). However, in common FFT-based docking methods, shape complementarity is computed differently, as rather ‘surface complementarity’ (Annex 6.3.2.).

The observed distribution of S(x) is skewed and wide. In order to compare interfaces from different complexes, Lawrence and Colman used the median values as the shape correlation index S

c

for each distribution. It has been shown to be in the range of 0.71 – 0.76 for a set of four protease – inhibitor, and in the range of 0.65 – 0.67 for five antibody – antigen complexes. According to the shape correlation index, the geometric match in the protease – inhibitor binding sites is better than for the antigen – antibodies.

Another measure of the surface complementarity on the basis of the compactness of the interface is the gap index proposed by Jones and Thornton [19, 20], defined as the ratio of the gap volume to the interface area (eq. 3).

€

i

_gap

= V

_gap

/ B ⁽³⁾

The gap volume is computed by placing a set of spheres in contact with atoms on the solvent accessible surfaces of both components of the protein complex. The spheres with diameter < 1 Å or > 10 Å are removed, and then the volumes of the remaining spheres are summed up. This parameter is clearly biased towards the volume at the edge of the interface, since the spheres there will be larger than those in the core of the interface and hence contribute more to i

gap

. This bias may be the reason for its reported variability, from 0.35 – 5.2 Å, in a set of 25 protein – protein complexes [20], with an average value of 2.5 Å; and its lack of correlation in respect of S

_c

. A somewhat lower value was reported for enzyme – inhibitor complexes (2.2 Å) and a higher value for antigen – antibody complexes (3 Å), in agreement with the trend observed on S

c

calculations.

An alternative measure of the geometric complementarity is based on the atomic packing. Complementary surfaces must form compact interfaces with well-packed atoms and few cavities. The packing density can be currently estimated by, first defining Voronoï polyhedron around each atom, computing then their volumes, and comparing them with a set of reference volumes. The use of Voronoï polyhedra to analyze protein packing was first proposed by Richards [34] and by Finney [35]. Later it was applied to protein – protein interfaces by Janin and Chothia [36]. A more updated set of atomic Voronoï volumes is the one published by Tsai [37]. They represent averages of Voronoï volumes of buried atoms inside globular proteins and are 10 % smaller than the corresponding volumes in crystals of small organic molecules. This means that atoms at the interior of protein structures are in general close packed, despite the presence of some cavities.

The compactness of a set of atoms relative to this reference can be calculated as the ratio of the sum V of Voronoï volumes to the sum V

₀

of the reference volumes. A V/V

₀

= 1 should correspond to close packed atoms, while a larger value corresponds to a looser packing.

When applied to interface atoms, Voronoï computations have a serious inconvenient.

Only fully buried atoms can have their polyhedrons drawn. In most of the known cases this

only true for 1/3 of the interface atoms, in particular, for those located in the center of the

interfaces. Consequently, measuring atomic compactness on fully buried interface atoms

would bias the values towards interface cores, as opposed to the gap index, which was biased

towards interface peripheries. A solution to that is to include crystallographic water molecules

at the interface, so that Voronoï polyhedrons can be closed for some atoms surrounded at the

(22)

same time by protein and water molecules (Figure 6). Doing so the number of atoms whose volume can be computed increases up to 2/3 of the interface atoms, in average.

Figure 6. Voronoï polyhedrons and packing volumes. Here a schematic representation is shown, where the polygon drawn around atom A is the equivalent in two dimensions of the three- dimensional Voronoï polyhedrons. In this example, the position of water the water molecule is needed to draw the polygon. This is figure is reproduced from Wodak and Janin, 2002 [31].

Computing the V/V

0

ratio for the 75 protein – protein complex dataset of Lo Conte et al. [21] without considering water molecules showed that it is in the 0.97 – 1.06 range. Thus the packing for those buried atoms at the interface is as good as for the interior of protein structures. Repeating the computations over a subset of 36 protein – protein complexes of 2.5 Å resolution or better, including crystallographic water molecules revealed a narrower interval, 0.97 – 1.03, for the V/V

0

ratio. According to these results, it is generally accepted that protein interfaces are as well packed as protein cores, except for the presence of water molecules that are almost excluded from the protein cores, and contribute significantly to the packing at protein – protein interfaces.

It is worthy to note that in the Lo Conte et al. dataset, unlike S

c

and the gap index, volume ratios for the protease – inhibitor and for the antigen – antibody complexes, 1.00 and 1.01 respectively, reflect the same close packing.

1.2.5. Chemical Composition and Interactions.

The average composition for interfaces in hetero complexes is 56 % non-polar carbon containing groups, 29 % neutral polar groups

⁶

and 15 % charged groups. This result is similar to the composition of the surfaces of proteins [21]. Interfaces of homo-oligomeric proteins tend to be more hydrophobic, with 65 % non-polar, 22 % polar non-charged and 13 % charged [38]. This difference in composition is thought to be related to the different status of the two types of interfaces: homo-oligomers are considered to be more permanent assemblies, while hetero complexes are more transient. However, from a strict physico-chemical point of view, many hetero-complexes must be in equilibrium with their corresponding homo- oligomers in solution, thus many of the homo-oligomers are not as permanent assemblies as they appear in these surveys.

There are also some differences at the interface composition between different categories of complexes. Protease – inhibitor interfaces have 61 % non-polar area, being more hydrophobic in average than other types of protein – protein interfaces. Antigen – antibody

6

These comprise all groups with N and O atoms not carrying a full electric charges.

(23)

Critical Assessment of Predicted Interactions at Atomic Resolution.

interfaces have in average 51 % non-polar area, which tend to be more polar, with 25 – 27 % of charged area.

The process of protein – protein recognition takes place through non-covalent interactions, which individually are weak, but collectively achieve enough strength as to produce association. They result in a trade off of forces, since the protein desolvation prior to binding must be compensated with van der Waals and electrostatics favorable interactions with the corresponding protein partner.

Polar groups at interfaces form hydrogen bonds either with solvent molecules or between themselves (direct H-bonds). On average there are about 10 direct interface hydrogen bonds per complex [21], but they are widely distributed showing a minimum number of 1 up to a maximum of 34 [21]. Interestingly, they are supposed to be less optimal and probably weaker than those found intra-molecularly [39]. In structures of 2.5 Å resolution or better, there is in average one H-bond per 170 Å

²

of interface area, with a mediocre correlation coefficient number of H-bonds versus B of 0.85 and one per 72 Å

²

of polar area.

Those values are also in agreement with the ones found by Jones and Thornton [20]: 1.13 H- bonds per 100 Å

²

of B/2. The same authors found 0.7 H-bonds per 100 Å

²

of B/2 in homo- dimeric proteins, in line with the more hydrophobic character of these interfaces respect to the ones in the hetero-complexes.

1.2.6. The presence of water molecules at protein – protein interfaces.

Water molecules are relative abundant in protein – protein interfaces. Analysis restricted to high resolution structures (2 Å or better) show that there are in average 1 crystallographic water molecule per 115 Å

²

of interface area. Even with a weaker correlation between the number and the interface area than for the H-Bonds, this ratio suggests that about 1 out 12 of the solvent molecules solvating protein surface atoms, remains at the interface after the complex is formed [40]. Since all these water molecule are involved in bridging hydrogen bonds, this means that protein interfaces have at least as many water-mediated interactions as direct H-bonds or salt bridges. Many often, this fact is not properly considered in structural biology. Furthermore, those numbers can be underestimated because there is no established practice among crystallographers for describing the solvent structure.

In some structures of protease – inhibitor complexes, like in the chymotrypsin – ovomucoid, the water molecules line the edge and form a ring around a dry central patch. On the contrary, for some antigen – antibody complexes whose interfaces are less hydrophobic than protease - inhibitor, the water molecules are throughout (wet interface). Homo-dimeric interfaces, which are generally at least as hydrophobic as chymotrypsin – ovomucoid, also tend to form dry patches surrounded by rings of water molecules [31]. However there exist examples of dry interfaces for antigen – antibodies and examples of wet protease – inhibitor complexes.

1.2.7. Structural Similarity Measure: Root Mean Square Deviation.

When we compare 2 protein structures considering all the common atoms or just

subsets of those, we need a measure of the structural similarity between them. The most

commonly used structural similarity is the root mean square deviation or rmsd (eq. 4). The

rmsd can be used as a deviation measurement between 2 protein structures, regardless they

are isolated or complexed, but I have included it here, together with the rest of structural

parameters to characterize protein – protein interactions, for the sake of simplicity.

(24)

€

rmsd =

d

_i²

i=1 N_atoms

∑

N

_atoms

;d

_i

= ( x

_i

− x

₀

)

²

⁺ ( ^y

ⁱ

⁻ ^y

⁰

)

²

⁺ ( ^z

ⁱ

⁻ ^z

⁰

)

²

(4)

The rmsd is computed between two sets of equivalent atoms in the two proteins.

Normally the atomic distances are measured after a molecular fit is performed on one of the molecules on the top of the other. The fitting process minimizes the squares of the atom distances (it is a least square fit), by first a rigid body translation of the geometric centre of the first molecule onto the second, and a rigid body rotation to optimally approach the second molecule to the first. The most commonly used algorithms to compute the least square molecular fits are the one by Kabsch [41] and the one by McLachlan [42].

During the rest of this thesis molecular fits are referred to McLachlan’s method. Also,

rmsds’ will be referred to polypeptide backbone atoms (N, C

_α

, C, O), unless other set of

atoms it is explicitly mentioned.