• Aucun résultat trouvé

Local Protein Structures

N/A
N/A
Protected

Academic year: 2021

Partager "Local Protein Structures"

Copied!
92
0
0

Texte intégral

(1)

HAL Id: inserm-00175058

https://www.hal.inserm.fr/inserm-00175058

Submitted on 11 May 2010

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Bernard Offmann, Manoj Tyagi, Alexandre de Brevern

To cite this version:

Bernard Offmann, Manoj Tyagi, Alexandre de Brevern. Local Protein Structures. Current Bioinfor- matics, Benthams Science, 2007, 2, pp.165-202. �inserm-00175058�

(2)

Preprint for Current Bioinformatics 2007

Local Protein Structures

Offmann B.

1

, Tyagi M.

1+

& de Brevern A.G.

2*

1 Laboratoire de Biochimie et Génétique Moléculaire, Université de La Réunion, 15, avenue René Cassin, BP7151, 97715 Saint Denis Messag Cedex 09, La Réunion, France

2 Equipe de Bioinformatique Génomique et Moléculaire (EBGM), INSERM UMR-S 726, Université Paris Diderot, case 7113, 2, place Jussieu, 75251 Paris, France

* Corresponding author:

mailing address: Dr. de Brevern A.G., Equipe de Bioinformatique Génomique et Moléculaire (EBGM), INSERM UMR-S 726, Université Paris Diderot, case 7113, 2, place Jussieu, 75251 Paris, France

E-mail : debrevern@ebgm.jussieu.fr Tel: (33) 1 44 27 77 31

Fax: (33) 1 43 26 38 30

key words: secondary structure, protein folds, structure-sequence relationship, structural alphabet, protein blocks, molecular modeling, ab initio.

+ Present adress : Computational Biology Branch, National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), 8600 Rockville Pike, Bethesda, MD 20894.

HAL author manuscript inserm-00175058, version 1

(3)

Abstract

Protein structures are classically described as composed of two regular states, the α- helices and the β-strands and one non-regular and variable state, the coil. Nonetheless, this simple definition of secondary structures hides numerous limitations. In fact, the rules for secondary structure assignment are complex. Thus, numerous assignment methods based on different criteria have emerged leading to heterogeneous and diverging results. In the same way, 3 states may over-simplify the description of protein structure; 50% of all residues, i.e., the coil, are not genuinely described even when it encompass precise local protein structures.

Description of local protein structures have hence focused on the elaboration of complete sets of small prototypes or “structural alphabets”, able to analyze local protein structures and to approximate every part of the protein backbone. They have also been used to predict the protein backbone conformation and in ab initio / de novo methods. In this paper, we review different approaches towards the description of local structures, mainly through their description in terms of secondary structures and in terms of structural alphabets. We provide some insights into their potential applications.

Introduction

Protein folds are often described as a succession of secondary structures. Their repetitive parts (α-helices and β-strands) have been intensively analyzed since their initial description by Pauling and Corey [1]. Nonetheless, this description of the 3D structures in terms of secondary structures is not simple and different major drawbacks must be carefully addressed. Indeed, the rules for secondary structure assignments are not trivial, and so numerous assignment methods based on different criteria have emerged. The greatest discrepancies are found mainly at the caps of the repetitive structures. These small differences

HAL author manuscript inserm-00175058, version 1

(4)

can result in different lengths for the repetitive structures, depending on the algorithm used. In addition, a classification of the backbone conformation limited to 3 states (the classical repetitive secondary structures and coils) does not precisely describe the protein structures, because it fails to describe the relative orientation of connecting regions. Besides, the coil state covering almost 50% of all residues corresponds to a large set of distinct local protein structures.

Thus, to circumvent these difficulties, other approaches were developed. They led to a new view of 3D protein structures which are now thought to be composed of a combination of small local structures or fragments, also called prototypes. A given complete set of prototypes defines a “structural alphabet.” Different groups described these local protein structures according to different criteria. Structural alphabets have been used to approximate and analyze local protein structures and to predict backbone conformation.

This paper is divided in two parts. First, we focus on the detailed analysis of known secondary structures with respect to the different secondary structure assignment methods.

Second, we present the complete panorama of known structural alphabets, i.e., libraries of protein local structures used in ab initio methods.

Thus, we present in the Secondary structure section the classical and less represented repetitive structures, the irregularities within these structures, the different kinds of turns, the Polyproline II and the loops. We focus on the problematic issue of secondary structure assignment and conclude on the step from secondary structure to 3D. The Structural alphabets section shows firstly the structural libraries dedicated to the structure approximation, secondly the different developed prediction methods based on structural alphabet description and finally some applications.

HAL author manuscript inserm-00175058, version 1

(5)

Secondary structures

Introduction. The description of protein structures in terms of secondary structures is

widely used for analysis or prediction purposes (see the example of Human Liver Glycogen Phosphorylase A [2] taken from the Protein DataBank [3] in Figure (1)). The secondary structures are composed of well-known α-helix [4] and β-sheet [5]. Secondary structure assignment is directly implemented in all 3D structure visualization softwares (e.g. Rasmol [6], molmol [7] or VMD [8]) which helps for the analysis of the protein scaffold. They are also used as the basis of classification of protein structures like in SCOP [9] and CATH [10]

databases. Since precise modeling of the structure of a protein remains a challenge, the prediction of secondary structures is an important research area [11] and has been included in many sophisticated prediction methods, like threading [12] or de novo approaches [13, 14].

Figure 1. Example of analysis of a protein structure fragment (Human Liver Glycogen Phosphorylase A [2], Protein DataBank [3] code 1EXV, residues from 400 to 500) described by secondary structures. We can observe a long α-helix (residue 400 – 417), a Polyproline II (438 -440), π-helix (489-494) and β-sheet (452-454 and 479-481).

HAL author manuscript inserm-00175058, version 1

(6)

Classical secondary structures. Before the first protein structure was solved [15],

Pauling and Corey have proposed many stable local protein structures [1, 4, 5], including two major local folds: (i) the α-helix (or 3.613 helix) characterized by intramolecular hydrogen bonds between amino acid residues i and i +4 and (ii) the β-sheet composed of extended chains with hydrogen bonds between adjacent chains. They roughly represent 1/3rd and 1/5th of the residues found in proteins. Long and short α-helices do not have the same amino acid composition according to Richardson and Richardson [16] and Pal and co-workers [17]. α- helices extremities have specific amino acid propensities [18-22] and specific physicochemical stabilizations [23]. For instance, C-capping motifs of α-helices are often stabilized by hydrophobic interactions between helical residues and residues outside the repetitive structures [24], e.g. the Pro C-capping motif [25]. For instance, helix 9, the major structural element in the C-terminal region of class Alpha glutathione transferases (GSTs), forms part of the active site of these enzymes where its dynamic properties modulate both catalytic and binding functions. The importance of the conserved aspartic acid N-capping motif for helix 9 was identified by Dirr and co-workers using sequence alignments of the C- terminal regions of class GSTs and in silico approaches [26, 27]. Indeed, the replacement of N-cap residue Asp-209 destabilizes the complete region.

A β-sheet is formed by the association of several β-strands via hydrogen bonds between residues from two distinct strands [5]. Thus, a fundamental difference between the two main regular secondary structures, α-helices and β-sheets, is the non-local nature of hydrogen bonds: partners can be far from each other in the sequence space. Depending on the strand orientation, a β-sheet can be parallel, anti-parallel or mixed, resulting in different hydrogen- bonded patterns [28]. This kind of planar arrangement introduces a periodicity in the side- chain orientation: side-chains point alternatively toward one side and the other side of the sheet. As for the α-helix, the sequence specificity of β-strands has been widely studied [29],

HAL author manuscript inserm-00175058, version 1

(7)

as well as the terminal residues of strands [30]. Nonetheless, the experimental [31-33], or statistical works on pair correlation [34, 35] have not given simple conclusion to analyses the specificity of pair interaction between neighboring residues of adjacent β-strands. The β-sheet assembly is more complex than simple pair complementarities [28, 36].

A consequence of this difference is the more complex aspect of β-sheet formation and our weakest understanding of the underlying mechanism [37]. Synthetic peptide combinatorial libraries arose as a source of new lead compounds [38]. Combinatorial libraries of genes provided new proteins or protein domains [39] and peptide libraries built on α- helical scaffolds appeared as a useful strategy for the identification of new antimicrobial and catalytic synthetic α-helical peptides [40]. On the other hand, the construction of peptides libraries that fold as β-sheet structures is more recent [41]. This is mainly due to the scarcity of data and incomplete understanding of the factors determining formation of such secondary structure motifs [36, 42]. For the last three decades, more than a thousand secondary structure prediction methods have been elaborated from the early statistical approach [43-46] to complex Artificial Neural Networks and hidden Markov Model [47-51].

Other repetitive structures. 310- and π-helices are less frequent helical states representing coarsely 4% and 0.02% of the residues in proteins. The 310-helix is characterized by intramolecular hydrogen bonds between amino acid residues i and i+3 [52-54]. Majority of 310-helices are short, containing three (one-turn) or four residues but two-turn and longer 310- helices have been reported [53]. They are commonly found at termini of α-helices [55, 56]

and act as connectors between two α-helices [54, 56] and their sequence content is different from α-helix [57]. An analysis of sequence and structural features of 310-helix adjoining α- helix and β-strand has recently been done. It shows that composites of 310-helices and β- strands are much more conserved among members in families of homologous structures than

HAL author manuscript inserm-00175058, version 1

(8)

those between two types of helices; often, the 310-helix constitutes the loops in β-hairpin or β- β-corner motifs [58].

In the π-helix (i.e., 4.416-helices) hydrogen bonds are formed between amino acid residues i and i +5. This helix conformation is less stable due to steric constraints [59-62].

Fodje et. al. [63] showed that π-helix would occur more frequently in protein structures that was previously described and would be conserved within functionally related proteins.

Weaver found in 8 out of 10 confirmed crystal structures which contained π-helices, that its unique conformation was directly linked to the formation or stabilization of a specific binding site within the protein [64]. A dynamic relationship would exist between the different kinds of helices as shown for instance between α– and π–helices [65]. 310-helices, and to a lesser extent π-helices, have been proposed to be intermediates in the folding/unfoldingof α-helices [66-68].

Since the description of β-strands, several analyses have shown that a strand can be found independently of a β-sheet, i.e. the isolated E-strand [69]. These isolated E-strands are clearly distinct from classical β-strands involved in β-sheets: (i) they exhibit particular sequence specificity, as for example an over-representation of Proline residues and (ii) they display high solvent exposure in the structures. Hence, the isolated strands are related to loops but with an extended geometry. Due to the low occurrences of these different states, the number of prediction methods dedicated to them is very limited. We could note the method SSPRO8 that performs reasonably well [49].

Irregularities in repetitive structures. The π-bulges form a particular kind of

discontinuity in helical structures. Like the π-helices [64], they are not frequently observed but they seem to be directly associated to protein function [70]. For instance, Erb2 protein transmembrane domain has been shown, using molecular dynamics approaches [71-73], to

HAL author manuscript inserm-00175058, version 1

(9)

display a transition state from an α-helix to a π-bulge motif, and this has further been confirmed by experimental approaches [74]. The π-bulge is also named α-aneurism as this structural motif was revealed in an insertion mutant of staphylococcal nuclease [75]. Since then, other cases have been found, e.g. Plasmodium falciparum 1-Cys peroxiredoxin [76] or µ-opioid receptor [77].

In the same way, most of the observed α-helices are distorted due to presence of proline residues [78, 79], solvent induced distortions [80] or peptide bond distortions [81]. Due to these local modifications, the three dimensional path of a α-helix often becomes non-linear [82]. Barlow and Thornton found in their set only 15% of linear helices, 17% were kinked and 58% curved [56]. These conclusions were confirmed by Kumar and Bansal with an enlarged dataset [83] and very recently by Martin and co-workers [84]. Bansal and co-workers have analyzed [85] and developed specific a software to classify the helices and showed again that important proportion are in fact curved or composed of two (or more) distinct helices [86].

A β-bulge is defined as a region between two consecutive β-type hydrogen bonds, which includes two or more residues on one strand opposite a single residue on the other strand [87, 88]. Found primarily in anti-parallel β-sheets, β-bulges are common, on average twice per protein [89]. These irregularities were first classified by Richardson and co-workers into two types [87] and later in five classes by Thornton [89].

The extra residue(s) on the bulged strand not only disrupts the normal alternation of side chain direction, but also impacts the directionality of β- strands and accentuates the typical right-handed twist of β-sheets. For these reasons, β-bulges are often well conserved in proteins. Their role is not clear; they may facilitate insertions or deletions in β-strands or position crucial residues by accentuating the local twist of the strands [90, 91], as it has been shown with insertions and deletions in a β-bulge region of Escherichia coli dihydrofolate reductase [91] and ubiquitin [92]. As they are more exposed than other β-strands residues,

HAL author manuscript inserm-00175058, version 1

(10)

they play an important role in protein–protein interaction and in protein function [93, 94] and have been suggested to be associated with some pathologies, like the aggregation of proteins into a fibrillar structure in the case of several neurodegenerative disorders [95]. However, the underlying molecular basis for the formation of β- bulges in proteins remains poorly understood.

Turns. Regions connecting repetitive helical and extended structures, known as loops,

have been extensively studied for the last decades. However, their classification is difficult to achieve namely for loop regions composed of more than 8 residues [96-99] where more precise descriptions are needed to encompass their whole diversity.

Alongside the helices and the strands, turns are perhaps one of the most interesting local fold. By definition, turns are small elements of secondary structure. They are constituted of n consecutive residues (denoted i to i+n) with a distance between Cα(s) of residues i and i+n that has to be smaller than 7 Å (or 7.5 Å depending on the authors). The tight turns are composed of γ-turns (n = 3), β-turns (n = 4), α-turns (n = 5) and π-turns (n = 6). The restrictive distance of 7 Å imply a particular geometry to the backbone which can therefore turn back on itself or more generally change of direction. As they orient α-helices and β- strands, they play a major role for the final protein topology. In order to not mix up with α- helices (which can be obviously considered as a succession of turns), the central residues of turns have to not be helical, e.g. residues i+1 and i+2 for the β-turns. Often, hydrogen bonds between the N-H of residue i and the C=O of residue i+n-1 stabilize the turn structure. Turns are classified into types according to the values of dihedral angles φ and ψ of the central residues. For the β- and α- turns, a deviation of ± 30° from these canonical values is allowed on 3 of these angles while the fourth can deviate of ± 45° [100].

The two most studied turns are the γ- (3 residues) and the β-turns (4 residues). The γ-

HAL author manuscript inserm-00175058, version 1

(11)

turns are composed of two categories, classic and inverse (see Table 1) [101-105]. The β- turns as defined by Venkatachalam are characterized by a hydrogen bond between N-H and C=O of residues i and i+3 and types I, II, III, and their corresponding mirror images I’, II’ and III’ were characterized [106]. These results have been confirmed with a limited set of proteins [107, 108]. Lewis enlarged this definition to several new categories: the β-turns V and V’, the β-turn VI which is characterized by the presence of a Proline, the β-turn VII which is

associated with a kink and the β-turn IV corresponding to all the non classified β-turns [109].

The very first documented analyses of turns in protein structures used this classification scheme [110-114]. However different turns have been excluded since then. The β-turns III and III’ are too close to the 310-helix, the turns V, V’ and VII are too rare and their definitions are inaccurate [100]. On the other hand, type VI were divided into 2 sub-types, that is, VIa and VIb. Lastly, Venkatachalam also noticed that some distorted type I β-turns have their φi+2

in the β-strand region (instead of α). Later, Wilmot and Thornton precisely defined type VIII [115] which is basically based on Richardson’s type Ib. Finally, Hutchinson and Thornton [116] divided type VIa in 2 sub-types VIa1 and VIa2. The definitions used by Thornton’s group [89, 117] are nowadays considered as the standard (see Table 1). They are widely analyzed in molecular dynamics [118] and prediction methods have been developed [119- 126]. Motifs and conformational analysis of amino acid residues adjoining β-turns in proteins have also been extensively described [127].

So, γ- and β-turns are the most important secondary structures following the α-helix and β-sheet. β-turns correspond roughly to 25 to 30% of the residues [128]. An interesting point is

that they are often observed as repeated tandems leading sometimes to long series of γβ, βγ, ββ or γγ turns [129]. It is also noteworthy that γ and β turns are found associated to the same residues [130, 131].

HAL author manuscript inserm-00175058, version 1

(12)

γ-turna φi+1 ψi+1 Classic 75.00 -64.00 Inverse -79.00 69.00

β-turnb φi+1 ψi+1 φi+2 ψi+2

I -60.00 -30.00 -90.00 0.00

I’ 60.00 30.00 90.00 0.00

II -60.00 120.00 80.00 0.00

II’ 60.00 -120.00 -80.00 0.00

III obsolete

III’ obsolete

IVc ---- ---- ---- ----

V obsolete

VIa1d -60.00 120.00 -90.00 0.00 VIa2d -120.00 -120.00 -60.00 0.00 VIbd -135.00 135.00 -75.00 160.00

VII obsolete

VIII -60.00 -30.00 -120.00 120.00

α-turnb φi+1 ψi+1 φi+2 ψi+2 φi+3 ψi+3

I RS -60.00 -29.00 -72.00 -29.00 -96.00 -20.00

I LS 48.00 42.00 67.00 33.00 70.00 32.00

II RS -59.00 129.00 88.00 -16.00 -91.00 -32.00 II LS 53.00 -137.00 -95.00 81.00 57.00 38.00 I RU 59.00 -157.00 -67.00 -29.00 -68.00 -39.00

I LU -61.00 158.00 64.00 37.00 62.00 39.00

II RU 54.00 39.00 67.00 -5.00 -125.00 -34.00 II LU -65.00 -20.00 -90.00 16.00 86.00 37.00 I C -103.00 143.00 -85.00 2.00 -54.00 -39.00

Table 1. Values of dihedral angles of γ-turns [105], β-turns [117] and α-turns [134].

a Allowed angles variations: +/- 40 °.

b Allowed angles variations: +/- 30 ° for the angles with at most one angle allows to deviate by +/- 45°.

c Turns which do not fit any of the above criteria are classified as type IV.

d Types VIa1, VIa2 and VIb are characterized by a cis-proline (i+2).

Shorter turns (e.g. 2 residues δ-turns) [132] and longer ones (e.g. 5 residues α-turns

HAL author manuscript inserm-00175058, version 1

(13)

[133-135] and 6 residues π-turns [136]) have been less studied. Only the α-turns has been the object of a classification scheme (see Table 1) [134]. α-turns have a functional role in molecular recognition and protein folding. For instance, residues in the α-turn in protein human lysozyme participate in a cluster of hydrogen bonds, and are located in the active site cleft, suggesting the possibility of a functional role [137]. Some are also involved in metal ion coordination [138, 139]. Moreover, α-turns are also relevant structural domains in small peptides, particularly in cyclopeptides containing 7–9 residues in their sequence [140-142].

Recently, a very elegant classification of α-turns has been proposed and the analysis of sequence – structure correspondence has highlighted the potential implication of α-turns in helix folding [143].

Polyproline II. The Polyproline II (PII) helices correspond to a specific local fold first

discovered in fibrous proteins [144-146]. They contribute to the creation of coiled coil supersecondary structures characteristic of these fibrous proteins but are also found in numerous globular proteins. Because of their characteristic backbone angles and trans isomers peptide bonds, PII helix is a left-handed helical structure with an overall shape resembling a triangular prism. It is extended, with a helical pitch of 9.3 Å / turn, 3 residues per turn. This α- helical conformation is characterized by canonical values of φ around -75 ° and ψ around +145°, i.e. characteristic dihedral angle values of β-strands. There has recently been an increase of interest in PII conformations [147-151], especially in the field of molecular dynamics [148, 152-154]. Even if they are called polyproline, they are not only composed of Proline successions and some PII helices have no Proline at all [155-159] like short stretches of poly-glutamines [160]. Adzhubei and Sternberg [155] found 96 PII helices in a databank of 80 proteins. This was thought to be unexpectedly common. They found that these PII helices were highly solvent-exposed and tended to have high crystallographic temperature factors. PII

HAL author manuscript inserm-00175058, version 1

(14)

are not stabilized by salt bridges [161]. It was suggested that PII helices are often stabilized by main-chain-water hydrogen bonds (in the absence of main-chain-main-chain H-bonds), and tend to have a regular pattern of hydrogen bonds with water [162]. They are, however, still much less solvent-accessible than experimentally studied peptides. Stapley and Creamer [157]

additionally suggested that local side-chain to main-chain hydrogen bonds are important in stabilizing PII helices. Cubellis and co-workers recently highlighted that PII helices are stabilized by non-local interactions [150]. They do not display strong sequence propensities in contrast with other extended conformations, such as β-strands [163]. The non-local stabilization of hydrogen-bond donors and acceptors does, however, result in PPII conformations being well suited for participating in protein-protein interactions. They are suspected to be implicated in amyloid formation [164, 165] and nucleic acid binding [166].

As recently highlighted, actual molecular dynamics parameters seem to underestimate the polyproline II and so diminish its frequencies [167].

Loops. Even after classification of protein backbone using classical three-states

described above, many residues are still associated to the coil states (i.e., nearly half of the residues). Several studies have hence focused on distinct conformation subsets of loops linking specific secondary structures. There are of 4 distinct loop classes (α-α, α-β, β-α and β-β) [168] and most of the studies focused on loops of less than 9 residues.

The β-hairpins correspond to loops connecting two adjacent antiparallel strands. They have been widely analyzed since they are widespread in globular proteins. Different classes have been identified resulting in the definition of structural families [169-172]. Interestingly, the short length hairpins are often characterized by a specific turn, i.e., a quick return of the protein backbone [173], like a β-turn [174]. Sometimes, stabilization by disulfide bonds are observed [175]. Characteristic sequence patterns have also been highlighted, e.g. in

HAL author manuscript inserm-00175058, version 1

(15)

erythropoietin receptor agonist peptides [176], and used to aid loop homology modelling [177]. The β-hairpins have been well studied in molecular dynamics [178, 179]. Other types of motifs connecting two β-strands have also been analyzed like the β-β corners [173].

Orthogonal ββ motifs, i.e. consecutive strands forming an ‘L’ structure with an angle of 90°, have been identified [173, 180]. These motifs are often associated with particular types of loops making the connection.

α-α turn motifs, and corners, in proteins have also been described in detail [181, 182].

A recent study showed that two predominant linking backbone conformations are observed for a given short link length and some linking backbone conformations correlate strongly with distinctive inter-helical geometric parameters [183]. Wintjens and co-workers [184] presented an automatic classification procedure of protein short fragments and described ten α−α turn families that tend to exhibit some conserved sequence features. As for α-β and β-α loops, preferred conformations have also been found [96, 185, 186].

Other interesting local structures, less frequent than the turns have also been described in the coil state. For instance, the Ω-loops constitute a particular category characterized by a small distance between their extremities and an important number of contacts in their structure [187-189]. They correspond to compact globular loops mainly located at the surface of the proteins [190]. They may be directly associated with protein function [191-195] and folding [192]. An important number of studies focused on the cytochrome c and the use of compatible Ω-loops to replace existing local 3D conformations [191, 196-198]. These studies have to be viewed as complementary to investigations on closed loops, Tight End Fragment (TEF) and MIR (Most Interacting Residues) which define loop fragments that are able in three-dimensional (3D) space to nearly close their ends [199-201]. These fragments are not only composed of residues associated to coil residues but also with regular secondary structures [202].

HAL author manuscript inserm-00175058, version 1

(16)

Loops within protein families have been extensively analyzed with respect to their role in the stabilization of proteins [104]. Tramontano and Lesk used rmsd (root mean square deviation) criteria to describe structural determinants of the conformations of medium-sized loops in proteins and focus on immunoglobulins [203, 204]. This early lead already highlighted the difficulty to analyze long length loops. Complete classifications have been attempted only for short and medium size loops due to the low occurrences of longer loops and to their larger variability. For instance, Rice and co-workers showed the example of a helix – turn – strand motif found in α-β proteins that was well characterized in short loops, but not in longer loops due to the absence of local constraints [205].

Ring and co-workers [206] are the only one to propose a classification scheme for loops based on their linearity and planarity defining three categories, named linear strap loops, the non-linear and planar omega loops (not to be confused with the Ω-loops), and the non- linear and non-planar zeta loops. Their databank was composed of 432 loops. Interestingly, they proposed for the longest loops to categorize them in a forth category defined as any combination of the first three ones. They used their analysis to propose a prediction approach based on genetic algorithms named Bloop [207].

Sun and Jiang have used a non – redundant databank of 240 proteins with a resolution of less than 2.5 Å, focusing on loops of length from one to five [208]. The classification is based on a clustering of the phi-psi space into zones and 34 classes of supersecondary motifs occurring at least five times have been identified, most of them were commonly occurring supersecondary structure motifs.

In Sloop, elaborated by Donate and co-workers [209, 210], loops were classified according to their length (from one to eight residues), the type of bounding secondary structures and the conformation of the main chain. The clustering was performed thanks to a hierarchical clustering based on rmsd distance between the loops. Thus, 161 well populated

HAL author manuscript inserm-00175058, version 1

(17)

conformational classes were determined, and further grouped into families. For each conformational class, amino acid sequence preferences were identified. Residues located in highly conserved positions were shown to be mainly involved in the stabilization of the loop conformation or to be associated with specific functions, new classes included a 2:4 type IV hairpin, a helix-capping loop, and a loop that mediates dinucleotide-binding [210]. Their databank comprises 2,024 loops taken from 223 proteins (resolution < 2.7 Å). An approach for loop prediction was further proposed based on the identification of preferred loop conformational classes in the databank [211]. For every query, the procedure consisted in identifying among the 161 conformational classes, those that were compatible in terms of sequence preferences and disposition of bounding secondary structures. Further prediction was performed with a new evaluation dataset that comprises 1,785 loops extracted from 138 new proteins that share less than 35% of identity sequence with the initial set of proteins.

Updates of this databank of supersecondary fragments were then performed, with a considerable increase in the number of conformational classes amounting 560 well populated categories with loops up to 20 residues in length [212, 213].

Geetha and Munson [214, 215] used a set of 330 proteins sharing less than 45% of sequence identity and a resolution better than 2.25 Å. The clustering algorithm proceeded with the use of two criteria: Cα distance within the loop fragments and dihedral angles of the protein backbone. They analyzed 3,313 loops of length two to eight, highlighting for instance the orthogonal architecture of the α-class proteins. They described new clusters and new relationship between sequences and structures.

Wloop is an interesting approach developed since 1996 by Chomilier and his group [216] that proposes taxonomy of the loops. Wloop proceeds by clustering loops of three to eight residues in length. Loops of the same length were placed in a common reference frame and classified within families of similar three-dimensional structures. The dataset used was

HAL author manuscript inserm-00175058, version 1

(18)

composed of 243 proteins sharing less than 50% of sequence identity. Contrarily to most of the known loop classification procedures, the clustering methodology does not rely on the nature of the neighbouring secondary structures. In total, 1,586 loops were grouped into 183 clusters. Sequence and conformational signatures were then deduced. The loop taxonomy differentiates clusters, relying on the mean distance between the first and last alpha carbon and the distance to the centre of gravity of the cluster. The database was then extended to 13,563 loops extracted from 1,411 protein structures sharing less than 50% sequence identity [97]. Using this new classification scheme, a prediction was performed using a new evaluation dataset of 47 and 48 entries sharing respectively a redundancy inferior and superior to 95% with the PDB. The Wloop web service has recently been upgraded to facilitate the newly implemented prediction scheme [217].

Wintjens and co-workers used a two-step methodology to define their loop clusters [96, 184]. The first step consisted in clustering the loop fragments according to zones within Ramachandran maps. In second instance, the loops within each class were superimposed to evaluate the quality of the clusters. A cluster was split if rmsd values exceeded a fixed threshold. From a dataset of 141 proteins sharing less than 20% sequence identity, they analyzed 15 αβ and 15 βα kinds of loops [96]. Previously, they had characterized 10 αα categories of loops. They focused on the most occurring clusters. This databank was used by Boutonnet and co-workers to characterize αββ and ββα supersecondary structures [98].

ArchDB is from Oliva and co-workers [218]. They analyzed 3005 loops coming from a non-redundant databank of 283 proteins sharing less than 25% of sequence identity and classified them into five major types according to their flanking secondary structures: α-α, β- β links, β-β hairpins, α-β and β-α. The clustering algorithm, based on both the loop main-

chain dihedral angles and the geometry of the bracing secondary structures, generated 56 classes that were further subdivided into 121 sub-classes. Consensus sequences were then

HAL author manuscript inserm-00175058, version 1

(19)

derived. The clustering procedure was then improved and fully automated resulting in ArchDB database [219]. In addition, updates enabled the inclusion of clusters for many long loops. ArchDB was to provide functional information. So, they have used this approach to classify the loops obtained from a set of 141 protein structures classified as kinases. A total of 1813 loops were classified into 133 subclasses (9 ββ links, 15 ββ hairpins, 31 αα, 46 αβ and 32 βα). Functional information and specific features relating subclasses and function were included in the classification. Functional loops were classified into structural motifs e.g. the P-loop shared by different folds. Hence a common mechanism for catalysis and substrate binding was sustained for most kinases [220]. ArchDB has also been used in prediction process with excellent results [221], the dataset used was based on SCOP 40 of the 1.61 SCOP release [9]. A recent application of such an approach has found more than 500 new putative function-related motifs not reported in PROSITE [222].

Li et al. [223] developed a database of loops extracted from a set of homologous proteins taken from FSSP database [224] where the structures had a resolution better than 2.5 Å. In their study, loops were grouped into families when they had well-superimposed bounding secondary structures. They used a hierarchical average linkage cluster analysis, which resulted in 84 loop families of 2 to 13 residues long. Subfamilies were generated and sequence features were characterized. This work enabled them to observe the diversity of loops on specific protein frameworks.

The “Loops In Proteins” (LIP) database that was developed by Michalsky and co- workers [225] is based on a non-redundant protein databank (sharing less than 20% of sequence identity) of excellent resolution (less than 1.8 Å). It included all protein segments ranging from 1 to 15 residues in length contained in the Protein Data Bank, which amounts to about 108. This database was used for loop prediction in the framework of homology modelling. The prediction strategy consisted in efficiently selecting loop candidates from the

HAL author manuscript inserm-00175058, version 1

(20)

database and in ranking them. The main-chain atoms of the top-scoring loop candidates were chosen as templates. Accurate prediction results were obtained, particularly for long loops.

Name web address database prediction last update

Sloop [211] http://www-cryst.bioc.cam.ac.uk/sloop/ yes no 2002

Archdb [218] http://gurion.imim.es/cgi-bin/archdb/loops.pl yes no 2004 Wloop [97] http://bioserv.rpbs.jussieu.fr/cgi-bin/WLoop no yes 2006

http://psb11.snv.jussieu.fr/wloop/ (obsolete version)

LIP [225] http://www.drug-redesign.de/LIP/ no no Test set (2003)

Table 2. Web services about protein loops.

Table 2 summarizes the available web services. Nonetheless, it must be noted that the major difficulty remains the definition of the regular secondary structure elements since the assignment of their boundaries directly defines the loops (see assignment methods section below). Similarly, in most of these studies, not all the protein loops were taken into account, most of the time some of the low occurring loops were withdrawn.

The assignment methods. Often the secondary structure assignment methods (SSAMs)

are considered not as a specific problem, the visualization tools doing “naturally” the assignment. However as noted by Arthur Lesk in his book [226], “what is unfortunate is that people use these secondary structure assignments unquestioningly; perhaps the greatest damage the programs do is to create an impression (for which Levitt, Greer, et al., [i.e., authors of SSAMs] cannot be blamed) that there is A RIGHT ANSWER. Provided that the danger is recognized, such programs can be useful”. Indeed different SSAMs exist. The difference between prediction and assignment of true structures is known for a long time [227]. However, the difference between secondary structure assignments is less frequently highlighted [228]. As noted earlier by Colloc’h and Cohen [30] and Woodcock and co- workers [229], a serious issue raised by the variety of methods for secondary structure

HAL author manuscript inserm-00175058, version 1

(21)

assignment is that they often yield diverging results. Different methodologies also differ in the level of detail they offer (i.e., the number of secondary structures they distinguish). Here, in the following paragraphs, we describe some of the existing assignment methods. Table 3 presents a summary of available SSAMs with different number of states that they can assign.

The first developed software was proposed by Levitt and Greer and used only the Cα positions as these atoms are the best precisely defined by X-ray crystallography [230]. In this paper, the authors described another assignment criteria based on torsion angle α and hydrogen bonds. They compared their assignments with the available assignment of 33 α- helices and 25 β-sheets. They highlighted the difficulty to assign precisely the short β-sheets with only the use of Cα, so they combined inter Cα distances assignment with H-bond assignment. Nowadays, the most widely used approaches are based on the identification of hydrogen bond patterns (DSSP [231], DSSPcont [232], SECSTR [63] and STRIDE [233]).

To date, DSSP (Dictionary of Secondary Structure Protein) [231] is the most popular method. In this methodology, secondary structure segments are identified by particular hydrogen bond patterns detected from the protein geometry and an electrostatic model. After computing all the H-bonds, the algorithm first assigns helical states (with a minimum length of four residues, three for the 310 and five for the π-helix) and then the β-sheets (with a minimum length of one residue). DSSP assigns eight different types of secondary structure including the aperiodic coil. DSSP is the basis of the assignment done by the Protein DataBank [3, 234]. Most of the prediction methods use the secondary structure assignment performed by DSSP to derive their parameters.

HAL author manuscript inserm-00175058, version 1

(22)

Table 3. Different available SSAMs with the states they can assign.

A recent version of DSSP called DSSPcont (Continuous DSSP) was proposed by Rost Methods Year Helical state Extended state Coil

DSSP [231] 1983 α-helix β-strand turn

(DSSPcont [232]) (2001) 310-helix β-bridge bend

π-helix coil

DEFINE [82] 1988 α-helix β-strand coil

PCURVE [247] 1989 α-helix β-strand coil

Consensus [253] 1992 α-helix β-strand coil

STRIDE [233] 1995 α-helix β-strand turns

310-helix β-bridge coil π-helix

PSEA [241] 1997 α-helix β-strand coil

XTLSSTR [246] 1999 α-helix β-strand h-bonded turn

310-helix un h-bonded turn

polyproline II coil

PROSS [242] 1999 α-helix β-strand coil

polyproline II

STICK [258] 2001 α-helix β-strand coil

SECSTR [63] 2002 α-helix β-strand coil

310-helix π-helix

VoTap [248] 2004 α-helix β-strand coil

t-number [250] 2005 α-helix β-strand coil

KAKSI [84] 2005 α-helix β-strand coil

Beta-Spider [251] 2005 α-helix (DSSP) β-strand coil

SEGNO [150] 2005 α-helix β-strand coil

310-helix polyproline II

π-helix

PALSSE [252] 2005 α-helix β-strand coil

HELANAL [86] 2000 α-helix (5) / /

EXTENDED-BETA 2002 / β-sheet (5) /

[259] β-strand

PROMOTIF [117] 1996 α-helix β-strand γ-turn (2) β-bulge (10) β-turn (10)

β-hairpins

HAL author manuscript inserm-00175058, version 1

(23)

[232]. It is based on the principle that any discrete assignment is incomplete, because the continuum of thermal fluctuations cannot be simply described. Hence, a continuous assignment of secondary structure that replaces 'static' by 'dynamic' states is used similarly to NMR studies which have emphasised the importance of structural changes over multiple length and time scales. Protein structure determination by NMR spectroscopy finds many models, the ensemble that is consistent with experimental constraints. The variations between these models result partially from experimental inconsistencies and incomplete data sets, but they are also believed to result partially from intrinsic fluctuations. Thus, DSSPcont assignments are obtained as weighted averages over ten DSSP assignments with different hydrogen bond thresholds. The continuous DSSP assignments calculated from a single set of coordinates may reflect the structural variations due to thermal fluctuations. The goal is to compensate at best the fluctuations of the assignment between the different models [232, 235- 237].

SECSTR is an evolution of DSSP method. As crystallographers do not find correctly existing π-helices [238, 239], Fodje and Al-Karadaghi developed improved π-helices detection parameters. In particular, the hierarchy of detection was modified in order to focus on the correct assignment of 310 and π-helices [63]. This method logically assigns more π- helices than others (10 times more than DSSP).

STRIDE [233] is also a widely-used method. It was done because DSSP often assigns too short helices. The principle of STRIDE is identical to DSSP but also takes into account dihedral angles. Hydrogen bonds are detected with an empirical energy function. The different parameters have been optimized regarding the definition of helices and strands in PDB files. The number of distinct states in this method is seven (including aperiodic), the bend defined by DSSP is here associated to classical turns. Its assignment is really close to the one done by DSSP (95% of identity). The differences are due to confusion between α-helices

HAL author manuscript inserm-00175058, version 1

(24)

and coil (1%) and to confusion between β-strand and coil (4%) [99].

PROMOTIF derives also from DSSP approach (an unpublished software called SSTRUC [240]) but focus on the characterization of γ- and β-turns, β-hairpins and β-bulges [117]. It uses the detection of repetitive structure reading the remarks of PDB files and when none is available, it supplies it using an assignment done by SSTRUC.

The periodicity of α−helices and β−strands generates some regularity in the backbone topology. Hence, some assignment methods do not use the detection of hydrogen bounds, but other characteristics of repetitive secondary structures.

DEFINE method [82], like Levitt’s and Greer’s method, uses only the Cα positions. It computes inter-Cα distance matrix and compares it with matrices produced by ideal repetitive secondary structures.

KAKSI is a novel assignment method that uses inter-Cα distances and dihedral angles as criteria [84]. Its principle is hierarchical (see Figure (2)). Firstly, the helices are assigned if the inter-Cα distances and / or dihedral angles are in a defined range. Secondly, the β-sheets are assigned if both inter-Cα distances and dihedral angles are in a defined range. The range of allowed inter-Cα distances and dihedral angle patterns have been optimized using the helices and sheets found in PDB. KAKSI have some specific features, such as a procedure for kink detection in α-helices resulting in the assignment of several distinct short helices instead of a long curved one. It is also less affected by the quality of the protein structure resolution.

Indeed, higher are the resolution values, lower are the secondary structure contents.

HAL author manuscript inserm-00175058, version 1

(25)

Figure 2. Principle of KAKSI assignment process. Firstly, α-helices are assigned (1) using distance criteria and / or (2) angles criteria. (3) Kinks are then detected. Secondly, the β- sheets are detected using sliding windows, if both (4) distance criteria and (5) angles criteria are within the selected ranges, the two β-strands are assigned. (6) If a α-helix and a β-strand are continuous, the α-helix is shortening.

PSEA [241] assigns secondary structures solely from Cα position using distance and angles criteria. This approach is also not much sensitive to the quality of the structures as the Cα are always the best resolved atoms. It is particularly sensitive with respect to the assignment of small β-strands.

PROSS [242] is based only on the computation of φ and ψ dihedral angles. The Ramachandran map is divided into meshes of 30° or 60° and the secondary structures (helices, sheet, polyproline II) are assigned according to their successions of encoded mesh.

This approach has been widely used to analyze the folding of polyproline II [148, 159], the continuity between C-terminal end of α-helix and N-terminal end of β-strand [243] or the compilation of the coil library, i.e. a convenient repository of all remaining structure after

HAL author manuscript inserm-00175058, version 1

(26)

these two regular secondary structure elements [244].

A new method called SEGNO [245] uses also the φ and ψ dihedral angles coupled with other angles to assign the secondary structures. This method has been used to analyze the Polypoline II helix [150].

XTLSSTR uses all the backbone atoms to compute two angles and three distances [246].

It is especially dedicated to spectroscopy and focus on amide-amide interactions.

PCURVE methodology [247] is based on the helical parameters of each peptide unit and generates a global peptide axis. The global shapes of secondary structures are then taken into account. This approach makes use of an extended least-squares minimization procedure to yield the optimal helical description where structural irregularities are distributed between changes in the orientation of the successive peptide groups and curvature of the overall helical axis.

A recent method uses Voronoï tessellation around Cα positions to compute a contact map [248]. It is called VoTap (Voronoï Tessellation Assignment Procedure) and is based on the Voro3D software [249]. This geometrical tool associates with each amino acid a Voronoi polyhedron, the faces of which define contacts between residues. It permits the distinction between strong and normal contacts. This new definition yields new contact matrices, which are analyzed and used to assign secondary structures. This assignment is performed in two stages. The first one uses contacts between residues along the primary structure and is mainly dedicated to local assignment, e.g. helices. The second step focuses on the strand assignment and uses contacts between distant residues.

In the same way, Vaisman and co-workers have developed a simple, five-element descriptor, derived from the Delaunay tessellation of a protein structure in a single point per residue representation, which can be assigned to each residue in the protein [250]. The descriptor characterizes main-chain topology and connectivity in the neighborhood of the

HAL author manuscript inserm-00175058, version 1

(27)

residue and does not explicitly depend on putative hydrogen bonds or any geometric parameter, including bond length, angles, and areas.

Beta Spider is the name of an European car of the 70’s but also the name of a SSAM [251]. It focuses only on β-sheet (the α-helix assignment is performed by DSSP) and for this purpose by considering all the stabilizing forces involved in the β-sheet phenomenon. Thus, not only the C=O...H-N hydrogen bonds are considered but also the C=O...C=O electrostatic dipoles and bifurcated H-bonds C=O...H-Cα. Beta-Spider also uses some geometrical factors, to make sure that the side-chains of the beta-sheet partners are pointing in the same direction.

Grishin and co-workers have recently developed a new method called PALSSE (Predictive Assignment of Linear Secondary Structure Elements) [252]. It delineates secondary structure elements from protein Cα coordinates, and specifically addresses the requirements of vector-based protein similarity searches. This program identifies two types of secondary structures: α-helix and β-strand, typically those that can be well approximated by vectors. In opposition to all the other SSAMs, this approach leads to surprising assignment for where a residue can be associated to a α-helix and also to a β-strand. It assigns about 80% of the protein chain to regular secondary structure. The authors declared that their method is robust to coordinate errors and can be used to define secondary structures elements even in poorly refined and low-resolution structures. This method is not dedicated to the analysis of protein structure but more to potentially perform a prediction.

As a consequence, these different assignment methods have generated particular problems. For example, DSSP can generate very long helices which do not correspond to the reality [56]. It is the main reason why Bansal and co-workers have analyzed and classified the helices as linear, curved or kinked [86]. In the same way, Woodcock and co-workers [229]

noted that these methods do not assign the same state to certain residues, especially those located at the beginnings and ends of repetitive structures. This observation has led to the

HAL author manuscript inserm-00175058, version 1

(28)

development of a consensus approach [253] which represents an average measure of DSSP, DEFINE and PCURVE. This study has shown that less than 2/3 of the residues are associated to the same state by these three algorithms. That was one of the motivations of KAKSI methodology, i.e. to define linear helices instead of long kinked helices (see L-mandelate dehydrogenase [254] in Figure (3)).

Figure 3. (a) Assignment of an α-helix of the L-mandelate dehydrogenase [254] (PDB code:

1P4C) by DSSP. (b) This helix is split in two by KAKSI.

The use of one or another method does not reflect the same type of reality. For instance, the α-helix defined by DSSP, with its eight states grouped in only three states, does not correspond only to α-helix (3.1613 helix), but incorporates the 310 helix and the π-helix (4.46- helix) too. In the same way, β-sheets (DSSP ‘E’ state) correspond to β-strands implicated in parallel or anti-parallel characteristic patterns but not to isolated Ε-strands. This can induce difficulties in analyzing the protein structures or dynamic features.

A recent study has compared five assignment software (DSSP, STRIDE, DEFINE, PCURVE and PSEA) [99]. It used an agreement rate, denoted as C3, which is the proportion

HAL author manuscript inserm-00175058, version 1

(29)

of residues associated with the same state between two assignment methods. The results of this work clearly highlight three points: (i) DEFINE yielded results very different from the other methods, as shown by its C3 values, close to 62%; (ii) DSSP and STRIDE produced nearly identical assignments, with C3 equal to 95%; (iii) all the other comparisons gave a mean C3 of 80%, with 6–7% confusion between α-helices and coils and 12–13% between β- strands and coils. In addition, DEFINE was the only method where confusion between α–

helices and β-strands was observed. These results show the difficulties for defining an appropriate length for α-helices, β-strands and coils and locating their ends. Another recent study has also shown the consequences of this differences on β-turn assignment [255].

Figure 4. 2D projections of the distance between different SSAMs (adapted from [84, 99, 256] and unpublished data).

Figure (4) [84, 99, 256] shows a projection of comparison studies between SSAMs. A small compact cluster is found; it encompasses all the “DSSP-like” hydrogen bonds related

HAL author manuscript inserm-00175058, version 1

(30)

method, i.e. DSSP, STRIDE and SECSTR. Spreading around them are found the other methods, i.e. based on different criteria, they have average disagreement rates around 20%.

DEFINE always remains distinct from all these methods because it over-assigns regular secondary structure and, with respect to this, is closer to PALSSE than the other SSAMs. It is important to note that the repetitive structures definitions only reflect a given classification and can disagree on structure description, especially on the segment extremities and on the presence of very short segments.

Figure 5: an example of multiple SSAMs for the beginning of bovine pancreatic deoxyribonuclease I protein (PDB code: 1ATN) [257].

Figures (5) and (6) show an example of multiple SSAMs on bovine pancreatic deoxyribonuclease I protein [257]. Figure (5) clearly highlights the variability of the assignments even for helices and for sheets. Figure (6) gives a visual representation of discrepancies in the various assignments.

HAL author manuscript inserm-00175058, version 1

(31)

Figure 6: an example of multiple SSAMs of bovine pancreatic deoxyribonuclease I protein.

(a) DSSP, (b) DEFINE, (c) P-SEA, (d) P-CURVE, (e) SECSTR, (f) BETA-SPIDER.

From SSEs to 3Ds. It is obvious from the above sections that the organization of three-

dimensional proteins structures can be represented as an assembly of different secondary structure elements arranged in a particular topology [100, 258] which characterizes a unique and particular fold [259]. Several distinct secondary structure combinations, generally between 2 and 4, form particular supersecondary motifs that can be found in many different folds. Many of them have been well characterized such as the simple β-hairpins [177] or more complex associations like triple-strand beta-sheets [186] and Greek key [100, 260].

Unfortunately, many folds contain very few or no super-secondary structure, e.g. the knottins [261], or contain secondary structure arrangements that are not very frequent in known protein structures.

HAL author manuscript inserm-00175058, version 1

(32)

Figure 7. The different Protein Units composing the 2aak protein. Are noted the positions of the PUs in brackets with their corresponding Compaction Index (CI), an index measuring the number of internal contacts [267, 268].

Since the 80s, many authors have proposed different methods to hierarchically split protein structures into small compact units in the aim of describing the different levels of protein structure organization [262-264]. The rules used by these methods are quite different.

To identify compact units, Lesk & Rose described the protein fragments as inertial ellipsoid and selected the most compact ones using a progressive growing approach [262]. Method proposed by Sowdhamini & Blundell to identify protein domain and supersecondary element

HAL author manuscript inserm-00175058, version 1

(33)

was based on Cα-Cα distances between secondary structures [263]. The algorithm developed by Tsaï & Nussinov used a complex scoring function, based on compactness, hydrophobicity and isolatedness, that measures stability of a candidate building block [264]. Another description at an intermediate level of organization, between secondary structures elements and domains, called Protein Units (PUs) has recently been proposed [265, 266]. A PU is a compact sub-region of the 3D structure corresponding to one sequence fragment. The basic principle is that each PU must have a high number of intra-PU contacts, and, a low number of inter-PU contacts (see Figure (7)).

Thus, organization of protein structures can be considered in a hierarchical manner:

secondary structures are the smallest elements, protein units are intermediate elements leading to the structural domains.

Conclusion. Secondary structures are really a powerful tool to analyze the protein

structures. The number of secondary structure prediction methods incredibly amounts as much as one thousand [11], beginning from simple statistical methods [43] to complex Artificial Neural Network combined with homology information like PSI-PRED [47] or SSPRO [49], that reaches 80% correct prediction. Similarly, 3-states secondary structures have been used in threading / fold recognition approach [12] and de novo approach [267].

Nonetheless, the assignment rules are not trivial. It is not due to the difficulty to accept a common definition with fixed values, but more to classical problem of classification where rules must be applied to delimitate the frontier of one class, such as the α-helix, and also the intrinsic flexibility of protein structure [268, 269]. This question is crucial and the scientific community seems to appreciate it more and more as the number of (and different) point of views grows. For instance, 4 different new SSAMs were proposed in 2005 only (KAKSI, PALSSE, Delaunay tessellation and Beta-Spider). A very recent elegant example of such

HAL author manuscript inserm-00175058, version 1

(34)

interest has been shown by Raveh and co-workers [270]. In a fully unsupervised manner, and without assuming any explicit prior knowledge, they were able to rediscover the existence of α-helices, parallel and anti-parallel β-sheets and loops, as well as various non-conventional hybrid structures.

HAL author manuscript inserm-00175058, version 1

(35)

Libraries of local protein structures

Introduction. As we have seen in the previous paragraphs, the secondary structures have

directed our conception of the protein structure analysis [271]. Nonetheless, the secondary structures focus on two kinds of regular local structures that compose only a part of the protein backbone. The remaining residues are only assigned if they can be associated with some particular structures such as the β-turns. In fact, the secondary structure assignment is highly hierarchic. The absence of assignment for an important proportion of the residues has lead some scientific teams to develop local protein structure libraries (i) able to approximate all (or almost all) the local protein structures and (ii) that do not take into account the classical secondary structures. To start with is the precursor lead of Unger and co-workers [272] whose work has led to numerous applications, from the reconstruction of protein structures [273] to the prediction of short loops [99]. These libraries brought about the categorization of 3D structures without any a priori into small prototypes that are specific to local folds found in proteins. The complete set of local structure prototypes defines a structural alphabet [274- 276]. The term “structural alphabet” was first introduced by Ring and co-workers to define a more precise description of the loops using 3 categories [206]. Numerous structural alphabets and names have been defined: Buildings blocks (BBs) for Unger and co-workers [272], Short Structural Motifs (SSM) for Unger and Sussman [277], Substructures for Prestrelski and co- workers [278], Local Structural Motifs (LSMs) for Schuchhardt and co-workers [279], Recurrent Local Structural Motifs (RLSMs) for Rooman and co-workers [280], Structural Buildings Blocks (SBBs) for Fetrow and co-workers [281], Local Structures (LSs) for Bystroff and Baker [267], Short Structural Building Blocks (SSBs or SSBBs) for Camproux and co- workers [282, 283], oligons for Micheletti and co-workers [284] and Protein Blocks (PBs) for de Brevern and co-workers [285]. They differ in the parameters used to describe the protein backbone like Cα coordinates, Cα distances, α or dihedral angles and in the methods used to

HAL author manuscript inserm-00175058, version 1

(36)

define them such as k-means [286], empirical function, Kohonen Maps [287, 288], artificial neural network [289] or Hidden Markov Model [290]. Each structural alphabet or fragment library is defined as a series of N prototypes of l residues length. N is highly variable, l only varies between 4 and 8. In the following paragraphs, we will present the most important works in this area. They are summarized in Table 4.

History. The increasing number of protein sequences and structures has supported the

concept of protein evolution through the divergence of the sequences and conservation of protein structures and, in some cases, convergence of protein sequences to a common structure. The number of related protein sequences has led to the generalization of homology modeling with softwares like Modeller [291-293] or Composer [294].

In 1986, Jones and Thirup reconstructed a retinol-binding protein (RBP) using fragments of the main chain from three unrelated proteins leading to a model with a Cα rmsd of 1.0 Å from the known structure [295]. It was the first usage of short 3D structural motifs.

This work has led to the use of known substructures for completing / refining low resolution X ray structures and suggested potential use in homology modelling for insertions. In 1989, Claessens [296] followed similar approach to rebuild the protein backbone with recurrent motifs derived from 66 high resolution structures. The constructed model was built using overlapping fragments of variable length. The final model was also less than 1.0 Å Cα rmsd deviation compared to crystal structures. Levitt [297] suggested construction of full atom models including side chains by pulling fragments from the PDB based on both sequence and structure consideration. This strategy is particularly efficient as most of the local protein structures are present in the PDB [298], even for the coils.

HAL author manuscript inserm-00175058, version 1

Références

Documents relatifs

Newer examples of COTS components include: complete stand-alone software systems that are being extended or integrated with other applications; components built using the

Où que j'aille, dans n'importe quelle ruelle, sur n’importe quelle place, d'un bâtiment à l'autre, tout demeurait obstinément désert et calme comme si un

The process of producing the database of homol- ogy-derived structures is effectively a partial merger of the database of known three-dimensional structures, here the

industries of all countries; showed how building research was contributing to improvements in the efficiency of building (one example given being that Great Britain now produces

C'est dans le pittoresque village de Fontaines, que s'est déroulée récemment, la grande soirée annuelle des Fribourgeois du Val-de-Ruz. Organisée à la perfection par un

The following collective variables were used: (i) Conformational FEL: the Cremer-Pople puckering coordinates ( Cremer and Pople, 1975 ) phi and theta (ϕ, θ) of the α-glucosyl

Ces travaux ont pour objectif, à partir de l’analyse en ciné-IRM du processus de déglutition chez un sujet sain, d’optimiser le rendement clinique de cette technique pour

In (Muppirala &amp; Li, 2006) and (Brinda &amp; Vishveshwara, 2005) the authors rely on similar models of amino acid interaction networks to study some of their properties,