Understanding signal sequences with machine learning

(1)

Thesis

Reference

Understanding signal sequences with machine learning

FALCONE, Jean-Luc

Abstract

Proteins synthesized in the cell must be transported to the correct cellular compartment so that they can achieve their function. This process is a fundamental aspect of cell protein metabolism. All the proteins that must be secreted, carry a particular region of conserved function, the signal sequence (SS) or signal peptide, located in N-terminal extremity. To address the problem of correctly discriminating secreted proteins from the other ones (cytosolic), artificial intelligence techniques have been considered. The training set was composed of E. coli proteins whose location was determined experimentally. We used a set of wild type proteins completed by two mutants sets: (i) 15 SS which have lost their function and (ii) 240 proteins which gained SS function. We used evolutionary computing to generate new features able to better predict secretion. The idea here was to extend existent theory. To reach this goal, we designed a generic framework to described physico-chemical requirements. To reduce the huge number of amino-acid properties to a tractable amount, we proposed a clustering method using on a novel correlation [...]

FALCONE, Jean-Luc. Understanding signal sequences with machine learning. Thèse de doctorat : Univ. Genève, 2008, no. Sc. 4073

URN : urn:nbn:ch:unige-182215

DOI : 10.13097/archive-ouverte/unige:18221

Available at:

http://archive-ouverte.unige.ch/unige:18221

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

Université de Genève Département d'informatique

Département de pathologie et d'immunologie

Faculté des sciences Prof. Bastien Chopard Faculté de médecine Prof. Dominique Belin

Understanding Signal Sequences with Machine Learning

THÈSE

présentée à la Faculté des sciences de l'Université de Genève pour obtenir le grade de Docteur ès sciences, mention Interdisciplinaire

par

Jean-Luc Falcone

de Carouge (GE)

Thèse N^o 4073

Genève 2008

(3)

(4)

(5)

(6)

In fact, I believe it will be necessary to have special programming systems that prevent the programmer from imposing on the model his ideas of how it ought to work. With the advent of computing to biological research, there is often the thought that we could build computing systems that would themselves provide us the answers to the problems we are facing. It seems to me that this combination of articial intelligence and human stupidity is the wrong one; what we need to combine is human intelligence with artical stupidity.

Sydney Brenner, Biological Computation in The limits of reductionism in biology, 1998.

We

are a tribe

of philosophers, theologians, magicians, scientists,

artists, clowns,

and similar maniacs [. . . ]

Mal-2 & Lord Omar Khayyam Ravenhurst, in Principia Discordia, 1965.

(7)

(8)

Remerciements

Je tiens remercier particulièrement les Prof. Bastien Chopard et Dominique Belin pour avoir tenté cette petite aventure interdisciplinaire en dirigeant ce travail de thèse. J'éspère que les résultats de notre travail les encouragerons à pérséver dans cette voie parfois dicile mais souvent gratiante.

J'aimerais remercier les Prof. Antoine Danchin et Amos Bairoch d'avoir accepté de faire partie de mon juré de thèse.

Je suis reconnaissant envers toutes les personnes qui m'ont aidé à envisager un changement de discipline, en particulier le Prof. Hans Bill qui m'accepté dans son cours de thermodynamique statistique, ainsi qu'Olivier Powell alors assistant du cours d'intelligence articielle.

Durant ma thèse j'ai eu de nombreuses discussions stimulantes et enrichis- santes avec plusieurs chercheurs et collègues. Parmis eux j'aimerais remercier particulièrement Michel Droz, Sylvain Biéler, Alexandre Dupuis, Stéphane Mar- coni, Jean-Pierre Hurni, Olivier Powell, Thierry Zwissig, Phillipe Moser, Patrick Roth, Jonas Latt, Bernhard Sonderegger, Frédéric Ehrler, Alain Hugentobler, Fokko Beekhof et Rak Ouared.

Un grand merci à Paul Albuquerque qui m'a beaucoup aidé à débuter ce travail de recherche et qui a eu la gentillesse de relire une partie de ma thèse.

J'aimerais remercier Filo Silva et Julien Cachin qui ont réalisé la plupart des expériences biologiques présentées dans ce travail.

Finalement, je dois très reconnaissant envers mes amis et mes proches, qui m'ont toujours soutenu durant ces dernières années.

7

(9)

(10)

List of Figures

1.1 The amino-acids . . . 2

1.2 Proteins structure . . . 5

1.3 Protein Translocation . . . 7

1.4 Co-Translational Translocation . . . 8

1.5 Post-Translational Translocation . . . 9

1.6 Examples of Signal-Sequences . . . 12

1.7 Signal Sequence . . . 13

1.8 E. coli Signal Sequence Length Distribution . . . 13

1.9 Signal sequences AA composition . . . 14

1.10 E. coli Signal Sequences Logo. . . 16

1.11 E. coli Non-Signal Sequence Logo . . . 16

1.12 E. coli Cleavage Sites Sequence Logo . . . 17

1.13 The Loop-Model . . . 19

1.14 Hydrophobic Moment . . . 25

2.1 E. coli down mutants . . . 31

2.2 Ovalbumine family examples and maspin construction . . . 33

2.3 Maspin random mutagenesis . . . 34

2.4 Maspin position 32 and 38 mutagenesis. . . 39

2.5 Leucine substitution in Maspin . . . 40

2.6 Maspin mutants fusions . . . 41

2.7 Maspin synthesis after mutagenesis on position 32. . . 42

2.8 Maspin synthesis according to four fusion contexts . . . 43

3.1 Decision tree example . . . 50

3.2 Hidden-Markov Model Example . . . 56

3.3 Relation between two hydrophobic scales . . . 59

3.4 H-region features thresholds . . . 61

3.5 H-region feature on E. coli proteins . . . 62

3.6 N-region features on E. coli proteins . . . 64

3.7 C-region feature on E. coli proteins . . . 64

3.8 CS prediction error . . . 65 13

(15)

14 List of gures

3.9 SigTreeC decision tree classier . . . 67

3.10 SigTree decision tree classier . . . 68

3.11 SigTreeMicro decision tree classier . . . 69

3.12 SigTreeMicro scatter plot . . . 70

4.1 Example of PCP . . . 76

4.2 Minimal spanning tree of PCP. . . 77

4.3 K-means example . . . 83

4.4 PCP clustering assessment . . . 85

4.5 Minimal spanning tree of the 92 clusters . . . 86

5.1 Evolutionary Computing Work ow . . . 94

5.2 Stack-based GP evaluation algorithm . . . 96

5.3 Stack-based GP evaluation example . . . 97

5.4 Forward-cleaning algorithm . . . 98

5.5 Backward-cleaning algorithm . . . 98

5.6 n-genes base system UML class diagram . . . 103

5.7 Object recycling class diagram. . . 104

5.8 Genes canonicalization class diagram. . . 105

5.9 Fitness lazy evaluation . . . 106

5.10 Symbolic regression of the 2D Rosenbrock function . . . 114

6.1 Sliding Window Feature Algorithm . . . 119

6.2 Best-Stretch pseudo-code . . . 120

6.3 Best Helix Feature Pseudo code . . . 122

6.4 Comparison of reduction functions. . . 123

6.5 Distribution of PCP cluster occurrences . . . 126

6.6 Accuracy improvement . . . 128

6.7 SigTreeGA DT using generated features . . . 129

6.8 Maspin mutants prediction vs. AP units . . . 132 6.9 Correlation between generated features and cleavage detection . 134

(16)

List of Tables

1.1 Amino-Acids List . . . 4

2.1 Maspin mutant collection . . . 36

3.1 Example of Training Set . . . 48

3.2 SigTree classiers performances on wt proteins. . . 67

3.3 SigTree classiers on down mutants . . . 72

3.4 Classiers comparison on WT proteins . . . 72

3.5 Classiers comparison on maspin up-mutants. . . 73

4.1 PCP cluster results . . . 87

5.1 n-genes GA benchmarks. . . 111

5.2 n-genes individual caching benchmarks. . . 112

5.3 Linear vs tree comparison . . . 113

6.1 Genes description . . . 124

6.2 SigTree classiers performances on wt proteins. . . 127

6.3 Classiers comparison on maspin up-mutants. . . 130

6.4 SigTree classiers on down mutants . . . 131

6.5 Generated Features . . . 131

15

(17)

(18)

Abbreviations and Conventions

List of Abbreviations

k-CV k-fold Cross Validation AA Amino-acid(s)

AP Alkaline Phosphatase

API Application Program Interface CS Cleavage Site

CV Cross-Validation DT Decision Tree

EC Evolutionary Computing GA Genetic Algorithm GP Genetic Programming

MCC Matthews Correletion Coecient PCP Physico-chemical property

SERPIN Serine Protease Inhibitor SS Signal sequence(s)

wt Wild-type

Pseudocode

The pseudocode used in this work was generated with the L^ATEX algorithmic package. We use the following conventions:

keyword All pseudolanguage keywords are in boldface

variable, v, τ The variable are either in roman shape, in math font or in greek letters.

i ←j Assignation operator, the value of j is assigned to i. // Beginning of a line-comment.

17

(19)

(20)

Abstract

Proteins synthesized in the cell must be transported to the correct cellular compartment so that they can achieve their function. This process is called protein targeting and is a fundamental aspect of cell protein metabolism. For instance blood plasma proteins and polypeptidic hormones must be delivered to the extracellular space. We are interested in the secretion pathway termed translocation, which involves the targeting and transport of the proteins out of the cell. The protein complex (called translocon) which exports the proteins varies from one species to another.

All the proteins that must be exported, carry a particular region of conserved function, the signal sequence (SS) or signal peptide, located in N-terminal extremity. The length varies slightly from 10 to 50 amino-acids (AA). The protein is exported before folding and the SS is usually cleaved after the export.

The most interesting feature of SS is their inter- and intra-species variability.

Their sequence as well as their length vary. Thus, they do not carry any system- atic consensus. However, three properties have been proposed as distinguishing features of SS: (i) They begin with an N-terminal region which includes one or several positively charged lysine or arginine residues. This region is called the N-Region. (ii) Following the N-Region, SS contain a stretch of hydrophobic AA forming the so-called H-Region. (iii) In the majority of secreted proteins there is a third region, the C-Region, located between the H-Region and the cleavage site. It carries a weak consensus recognized by the leader-peptidase.

The above properties are too vague to easily determine whether or not a protein will be secreted. The hypothesis that these three regions characterize an exported protein is based on observations made on known SS. However there are no experiments that conrm that these three properties are sucient or even necessary for the recognition process. Indeed several experiments indicate more complex interactions and specic structural requirements.

To address the problem of correctly discriminating secreted proteins from the other ones (cytosolic), articial intelligence techniques have been considered. Thus instead of using black box classiers as neural networks or SVM, we choosed decision trees which are simpler white box classifers. The advantage of these methods are the fact that the resulting classier can be analysed to

19

(21)

20 Abstract validate and extend existing theory. Moreover, although decision trees are usually less performant than neural networks or SVM, specic high-level features can usually counterbalance this drawback.

The training set was composed of E. coli proteins whose location was determined experimentally. We used a set of wild type proteins composed of 104 secreted and 161 cytosolic proteins. This set was completed by two mutants sets: (i) 15 SS which have lost their function and (ii) 240 proteins which gained SS function.

In a rst phase, we built a classiers using only features derived by the three properties described above. The goal was to determine if these three properties were necessary and sucient. The resulting trees were trained on wild-type sequences and reaches good performances on this set (94.3 % of cross- validated accuracy). However, on both mutants sets, the performances were very bad showing that the three properties were too rough to take into account subtle changes as single amino-acids substitution. Moreover, classiers without cleavage detection scored almost as well. This conrms the fact that the cleavage property is not necessary for export.

In a second phase, we used evolutionary computing to generate newt features able to better predict secretion. The idea here was to extend existent theory. To reach this goal, we designed a generic framework to described physico- chemical requirements. This description was based on the properties of the amino-acids combined in several manner such as to take into account possible secondary structures. To reduce the huge number of amino-acid properties to a tractable amount, we proposed a clustering method using on a novel correlation based distance. A new genetic algorithms and programming framework (termed n-genes) was develloped for this project, to permit a exible high-level description along with computational eciency.

This methods allowed us to complement our initial feature set with four new features. Not only the performances are higher than our preceding attempt on wild-type proteins (95.8 % of cross-validated accuracy), but also on mutant collections (100 % with the loss-of-function mutant collection and 87.6 % with the gain-of-function mutants).

Furthermore, the new properties give new insights about the signal sequence requirements. Indeed, if one of the property was based on a hydrophobicity measurment (although dierent than the most frequent ones), the other three reect structural constraints. An analysis of these new features, along with their usage in the decision trees, allowed us to explain some apparently contradicting experiments about signal sequence secondary structure.

(22)

Résumé

Les protéines synthétisées dans la cellules doivent être transportées dans le compartiment cellulaire approprié de manière à ce qu'elles puissent remplir leur fonction. Ce processus est appelé adressage des protéines et constitue un aspect fondamental du métabolisme cellulaire. Par exemple, les protéines du plasma sanguin ou les hormones polypeptidiques doivent être transportés dans l'espace extracellulaire. Nous nous sommes intéressé à la voie de sécre- tion appellée translocation impliquant l'adressage et le transport des protéines à l'extérieur de la la cellule. Le complexe protéique (appelé translocon) qui exporte les protéines varie d'une espèce à l'autre.

Toutes les protéines devant être exportées comportent une région particu- lière à la fonction conservée, la séquence signal (SS) ou peptide signal, située à l'extrémité N-terminale. La longueur varie légèrement entre 10 et 50 acides aminés (AA). La protéine est exportée avant le repliement et la SS est presque toujours clivée après l'export.

La caractéristique la plus intéressante des SS est sa variabilité inter- et intra- spécique. Non seulement leur séquence et leur longueur varie, mais elle sont également dépourvues de consensus systématique. Cependant, trois propriétés ont été proposées pour comme attributs déterminants des SS : (i) elles com- mencent avec une région N-terminal, incluant un ou plusieurs AA lysine ou arginine chargés positivement. Cette région est appelée région-N. (ii) Suivant la région-N, les SS contiennent une étendue d'AA hydrophobes formant la ré- gion appelée région-H. (iii) Dans la majorité des séquences secrétées, il existe une troisième région, la région-C, située entre la région-H et le site de clivage.

Cette région porte un consensus faible reconnu par l'enzyme leader-peptidase.

Les propriétés ci-dessus sont cependant trop vagues pour déterminer si une protéine sera secrétée ou non. L'hypothèse que ces trois régions caractérisent une protéine exportée est basée sur l'observation de SS connues. Cependant, il n'existe pas d'expérience qui puisse conrmer que ces trois propriétés sont susantes voire nécessaire. En eet, plusieurs expériences indiquent des interactions plus complexes et des contraintes structurelles spéciques.

Pour envisager le problème consistant à discriminer les protéines sécrétées des autres, nous avons considéré des techniques d'intelligence articielle. Tou-

21

(23)

22 Résumé tefois, au lieu d'utiliser des classicateurs "boîtes-noires" comme les réseaux de neurones ou les SVM, nous avons choisis les arbres de décision qui sont plus simple et de type "boîtes blanches". L'avantage de ces méthodes est que le classicateur peut être analysé pour valider, voire étendre, la théorie existante. De plus, bien que les arbres de décisions soient habituellement moins performants que les réseaux de neurones ou les SVM, l'utilisation d'attributs spéciques

"haut-niveau" peut balancer ce désavantage.

L'ensemble d'apprentissage a été composé à partir de protéines d'E. coli dont la localisation a été déterminée expérimentalement. Nous avons utilisé un ensemble de protéines de type sauvage comprenant 104 protéines sécrétées et 161 cytosoliques. Cet ensemble a été complété par deux ensembles de protéines mutantes : (i) 15 SS ayant perdu leur fonction et (ii) 240 protéines ayant gagné la fonction SS.

Dans une première phase, nous avons construit un classicateur en utilisant seulement des attributs dérivés des trois propriétés décrites ci-dessus. Le but était de déterminer si ces propriétés étaient nécéssaires et susantes. Les arbres obtenus ont été entraînés sur les séquences de type sauvage et atteignent de bonnes performances sur cet ensemble (94.3 % de précision en validation croisée). Cependant, sur les deux ensembles de mutants, les performances sont mauvaises, ce qui indique que ces trois propriétés sont trop grossières pour prendre en compte des changements subtils comme la substitution d'un seul AA. En outre, des classicateurs sans détection du site de clivage montrent des performances presque aussi bonnes. Ceci conrme que la présence d'un site de clivage n'est pas nécessaire pour l'export.

Dans une seconde phase, nous avons utilisé des méthodes de calcul évo- lutionaire pour générer de nouveaux attributs capables de mieux prédire la sécretion. L'idée était d'étendre la théorie existante. Pour atteindre ce but, nous avons mis au point un formalisme permettant de décrire des contraintes physico-chimiques. Cette description se base sur les propriété des AA combinées de plusieurs manières an de prendre en compte d'éventuelles structures secon- daires. Pour réduire le nombre très important de propriétés physico-chimiques des AA à un nombre utilisable, nous avons proposé une méthode d'aggrégation utilisant une nouvelle distance basée sur la corrélation. Une nouvelle plateforme pour les algorithmes et la programation génétiques (baptisée n-genes) a été dé- veloppée pour ce projet an de permettre l'utilisation d'une description exible de haut niveau, tout en conservant une vitesse d'exécution importante.

Cette méthode nous a permi de compléter notre ensemble initial d'attributs avec quatres nouveaux attributs. Non seulement les performances sont meilleures sur les protéines de type sauvage (95.8 % de précision en validation croisée) mais aussi sur les collections de mutants (100 % avec les mutants perte de fonction et 87.6 % avec les mutants gain de fonction).

(24)

Résumé 23 De plus, les nouvelles propriétés apportent un nouvel éclairage sur les pré- requis des SS. En eet, si l'une des propriétés est basée sur une mesure d'hy- drophobicité (bien que diérentes des plus courantes), les trois autres reètent des contraintes structurelles. Une analyse de ces nouveaux attributs, ainsi que de leur place dans l'arbre de décision, nous a permis d'expliquer certaines ex- périences apparement contradictoires à propos de la structure secondaire des SS.

(25)

(26)

Chapter 1 The Signal Sequence

Proteins are macromolecules involved in every cell process. Coded by the organisms genomes, they are synthesized by the cell as linear polymers of amino- acids. After synthesis, a number of specic proteins are secreted out of the cell.

These proteins carries special sequences of amino-acids in their initial extremity called Signal Sequences. Thus, retention in the cytoplasm is the default lo- calization of proteins lacking Signal Sequences. Although the Signal Sequences vary tremendously in size, composition and sequence, their functional activity is largely conserved across evolution.

This important variability makes it very dicult for researchers to predict secretion by looking at a protein sequence. We are interested in using computer science to extract the relevant criteria dening Signal Sequences.

We present in this chapter a basic introduction to proteins (Sec.1.1). Then we will introduce Signal Sequences, their known properties and some open questions (Sec.1.2 on page 10).

1.1 Proteins

In this section, we introduce briey some biological concepts to non biologist readers. This is only a rough sketch of the eld and more detailed information can be found in the textbooks ofStryer (1995, Chap. 2) andLewin (1997).

Proteins are macromolecules found in all living organisms including viruses.

They are not only essential structural constituents of cells but also their functional units. Almost all biochemical reactions (like synthesis of cell constituents, processing of cell waste, etc.) are mediated (catalyzed or not) by proteins called enzymes. Each protein is very specic to one task and this specicity depends: (i) of the molecular composition; (ii) of the location of the molecules, i.e. the protein structure.

1

(27)

2 1. The Signal Sequence

(a) AA structure (b) Lysine (c) Glutamic acid (d) Histidine

(e) AA chain

Figure 1.1: (a) AA general structure, where R is the side-chain. (b-d) Three examples of AA. (e) Example of an AA chain of size 3.

1.1.1 Amino-Acids

Each protein is a linear chain of 20 dierent molecules¹ called amino-acids (AA). We can thus describe a given protein as a word composed with an alpha- bet of 20 letters. Since the proteins vary in size, the number of dierent possible proteins is tremendous. For instance, for typical proteins sizes of 100300 AA we can produce P₃₀₀

i=100i²⁰ dierent proteins.

The AA have a common structure composed of an amine group (NH2), a central carbon atom and a carboxyl group (COOH). A side chain is present on the central carbon, which is dierent for each AA. (Fig. 1.1a). The dierent side chains confer dierent properties to the AA. The Tab. 1.1 lists the 20 AA of the canonical set, with their abbreviations and some of their properties.

Two AA can react together by forming a special type of bond called the peptide bond by condensation of the amine and carboxyl groups. The AA are linked together in this way in proteins and Fig. 1.1e shows an example of three AA bound together. Since the peptide bound is asymmetric, the chain is oriented from the amine extremity, termed N-terminus, to the carboxyl- extremity, the C-terminus. AA in the protein chain are often called residues.

The AA sequences of proteins of an organism are coded by its genes. The transfer of information is called in a global way the biosynthesis of proteins.

1Biologists recognize the existence of 22 AA. However the two recently discovered AA (se- lenocysteine and pyrolysine) are rare and there exists no evidence of their relevance for signal- sequences recognition. Cis and trans-forms of Proline could also be considered as two dierent AA. We will nevertheless stay to the classical set of 20 AA.

(28)

1.1. Proteins 3 This process occurs in two steps: (i) An RNA molecule is synthesized which is almost a copy of the gene coding for the protein. This step is called transcrip- tion. (ii) The RNA molecule is scanned by biological complexes (ribosomes) which assembles the AA linearly according to the information encoded in the RNA. This step is call translation.

1.1.2 Structure and modications

As seen above the sequence of AA gives the molecular composition of proteins.

However proteins are not linear ribbons oating in the cytoplasm. During and after the translation the nascent protein folds in a complex tridimensional structure.

Biologists distinguish four level of structural organization illustrated on Fig.1.2:

Primary structure: The linear structure of the proteins formed by the AA sequence.

Secondary structure: Local folding in simple and well dened motives like α- helices or β-sheets.

Tertiary structure: More complex folding, by assembly of secondary motives that form one or more protein domains. At this level the protein assumes its nal structure.

Quaternary structure: Assembly of several proteins to form a protein complex.

The assembled proteins (called subunits) can be identical (same proteins) or dierent.

Finally proteins are often chemically modied after translation (post-translational modications). For instance, the rst AA is often removed or sugars can be covalently added on the side chains of the AA.

1.1.3 Protein Targeting

As their names imply, cells are separated from the exterior medium with walls made of lipid membranes. Moreover, all eukaryotic cells² contains several internal compartments (like mitochondria, nucleus, vacuoles, etc. ) also separated by one or more lipid membranes.

Therefore, border crossing mechanisms exists to allow exchange of matter, energy and information with the cell surroundings. These selective mechanisms

2Eukaryotic cells are cells more complex than bacteria (like yeast or human cells) whose DNA is contained in a compartment called nucleus. Bacteria are called in contrast Prokaryotes

(29)

Table 1.1: Amino-Acid List. For each AA we detail: the full name, the three letters abbreviation (3LC), the one letter abbreviation (1LC), the Van der Waals volume in Angstrom (Vol.), the molecular weight (Mass), the charge at physiological pH and the Kyte and Doolittle (1982) hydrophobicity (HΦ).

Name 3LC 1LC Vol. Mass Charge HΦ

Alanine Ala A 67 89.09 0 1.8

Arginine Arg R 148 174.20 + -4.5

Asparagine Asn N 96 132.12 0 -3.5

Aspartic acid Asp D 91 133.10 -3.5

Cysteine Cys C 86 121.16 0 2.5

Glutamic acid Glu E 109 147.13 -3.5

Glutamine Gln Q 114 146.15 0 -3.5

Glycine Gly G 48 75.07 0 -0.4

Histidine His H 118 155.16 0 -3.2

Isoleucine Ile I 124 131.17 0 4.5

Leucine Leu L 124 131.17 0 3.8

Lysine Lys K 135 146.19 + -3.9

Methionine Met M 124 149.21 0 1.9

Phenylalanine Phe F 135 165.19 0 2.8

Proline Pro P 90 115.13 0 -1.6

Serine Ser S 73 105.09 0 -0.8

Threonine Thr T 93 119.12 0 -0.7

Tryptophan Trp W 163 204.23 0 -0.9

Tyrosine Tyr Y 141 181.19 0 -1.3

Valine Val V 105 117.15 0 4.2

(30)

1.1. Proteins 5

VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLTSKYR

(a)

(b) (c)

Figure 1.2: The dierent levels of structure of the human hemoglobin. (a) The primary structure of a human hemoglobin subunit. Each letter represent one of the 20 AA.

(b) The secondary and tertiary are shown together. First the protein folds locally inα-helices (represented as wide helical ribbons) and then the protein folds itself in a globular shape. (c) The nal protein complex is formed by the assembly of four subunit represented in dierent shades (ie. quaternary structure).

(31)

6 1. The Signal Sequence allow the transport of ions, sugars, electric signals, through specialized protein machinery lying in the cell/compartments membranes.

In the case of proteins, some are systematically exported after their synthesis in the extra-cellular medium. For instance, the insulin is exported in the blood by pancreas cells and some bacteria products and secrete theβ-lactamase enzyme which confer a resistance to a large family of antibiotics, such as peni- cillin and ampicillin. This phenomena will be called here protein secretion or protein translocation. As eukaryotic cells have several internal compartments, there exists specialized mechanisms for each compartment. The category con- taining all these phenomena is called protein targeting or protein sorting.

For the right proteins to be sorted in the right compartment (either outside space or internal compartments), there exists targeting signals which con- sists of particular AA sequences. The proteins carrying those sequences are recognized by the appropriate protein machinery and transported in the right compartment. An usual metaphor consist of seeing the transport machinery as a security control and the signal as an accreditation.

1.1.4 Protein Secretion

In this study, we will restrict our attention to the main pathway of proteins secretion originally discovered by Blobel and Dobberstein (1975). So we will skip both protein targeting to other compartments and other pathways translocating proteins out of the cell. The Sec pathway exists in all living organism and is essential to life. We will call hereafter the specic transporter translocon and the the specic signal Signal Sequence.

The Sec translocon is composed of a selective multimeric protein channel leading the secreted proteins across the membrane, SecYEG in eubacteria (Danese and Silhavy 1998) and Sec61αβγin eukaryotes (Osborne et al. 2005);

and several essential soluble factors recognizing the SS (for instance SRP or SecA), providing energy (SecA, BiP), keeping the protein unfolded (SecB), etc.

In prokaryotes, the translocon transfers proteins directly across the cytoplasmic membrane, in the periplasmic space, a soluble compartment in direct contact with the extracellular medium. In eukaryotes, the translocon leads proteins to the lumen of the endoplasmic reticulum, whose space is topologically outside the cell but which is not direct contact with the extracellular medium, as illustrated in Fig. 1.3. The proteins are then transfered to the exterior by an elaborate network of membrane vesicles. This pathway is also used for the membrane insertion of integral membrane proteins. In this case, proteins are released laterally into the membrane, where they remain bound by one or more transmembrane domains.

Originally the localizations of a protein, cytoplasmic or extracellular, were

(32)

1.1. Proteins 7

Figure 1.3: Protein translocation in eukaryotes and Prokaryotes. The arrow in the translocon represents the direction of the translocation. The thin curved arrows in the eukaryotic cell represent vesicle trac. E. R.: endoplasmic reticulum.

.

thought to be exclusive. This model relied mostly on steady state levels of selected proteins. Bitopology was detected later with proteins that can be stable in both compartments (Belin et al. 1989).

Protein translocation can occur during translation (co-translational) or after (post-translational) (Mitra et al. 2006). In case of co-translational translocation, the protein factor SRP (signal recognition particle) recognizes the Signal Sequence during synthesis and arrest translation (Nagai et al. 2003). The ribosome will then bind to the Sec channel and the translation will resume. The energy required for protein movement across the membrane is provided by the ribosome (GTP hydrolysis). In case of post-translational the protein is rst synthesized, but is maintained in an unfolded conformation by chaperones (for instance SecB in E. coli). Here the energy is provided by other solubles factors, like SecA in eubacteria (Economou and Wickner 1994) and BiP in eukaryotes (Panzner et al. 1995). The choice between co- and post-translational secretion depends of the Signal Sequence. In E. coli, various soluble factors probes the ribosome exit for nascent proteins (Eisner et al. 2003;Ullers et al. 2007). If the SRP factors bound durably to a SS, the secretion will occur co-translationaly.

If another factor such as trigger factor have a better anity, the secretion will be post-translational.

In these studies, we will consider the translocon as a whole system to fo- cus on the Signal Sequence. This black box will keep our approach simpler, but will not allow us to gure out interactions between individual translocon components and the SS. Similarly, we will not be able to take into account the timing in the sequence of interactions.

(33)

Figure 1.4: Co-Translational Translocation. (1) The ribosomes starts translating the protein. (2) The SRP recognize the Signal Sequence and binds, stopping the translation. (3) The SRP binds to its membrane receptor. (4) The ribosome docks to the translocon, the SRP is released and the translation resumes, translocating the nascent proteins. (5) The Leader-Peptidase cleaves the Signal-Sequence and the protein begins folding.

(34)

1.1. Proteins 9

Figure 1.5: Post-Translational Translocation in E. coli. (1) Chaperones (SecB) bind to the nascent protein blocking its folding. (2) Soluble SecA recognize the Signal Sequence. (3) SecA binds to the translocon and the membrane. (4) The translocation begins, SecA provides the energy (ATP hydrolysis). (5) The Leader-Peptidase cleaves the Signal-Sequence and the protein begins folding.

(35)

10 1. The Signal Sequence Some uncleaved and highly hydrophobic SS can anchor proteins in the membrane. They are termed signal anchors because of their double role. The protein can be inserted in both directions depending of the signal anchor characteristics (Wahlberg and Spiess 1997).

1.2 The Signal Sequence

The Signal Sequence (SS) also called Signal Peptide or Leader Peptide is a protein region, existing only in secreted proteins and carrying the information recognized by the secretion machinery. SS are present in the N-terminal extremity of this class of proteins and are variable in length and sequence (Martoglio and Dobberstein 1998; Hegde and Bernstein 2006).

In spite of this variability, SS functions are well conserved across evolution.

Several experiments have shown that a SS from a species can be secreted by the translocon of another species. This cross species conservation works also for very evolutionary distant groups. For example:

Hen ovalbumin can be expressed and translocated in E. coli (Fraser and Bruce 1978).

Rat preproinsulin is exported eciently by the E. coli translocon (Tal- madge et al. 1980).

Bacterial TEMβ-lactamase is exported by microsomes obtained from dog pancreas (Müller et al. 1982).

Bacterial TEM β-lactamase can also be translated and secreted in Xeno- pus oocytes (Wiedmann et al. 1984).

A fusion of bee Prepromelittin and mouse Dihydrofolate reductase is secreted by E. coli (Cobet et al. 1989).

However there are some limits:

Yeast carboxypeptidase Y is not translocated into dog microsomes. How- ever, it is translocated with a mutation increasing the hydrophobicity (converting one or two Glycines into Leucines) or when the carboxypeptidase Y is fused with the inuenza virus hemagglutinin (Bird et al. 1987), whose initial mature portion carries negative charges instead of positives.

A fusion of the Gram⁺ S. aureus protein A SS with an engineered human protein is not secreted by the Gram^- E. coli. However, it can be secreted by the insect baculovirus expression system (Allet et al. 1997).

(36)

1.2. The Signal Sequence 11 One of the most conserved feature in protein export is the removal of the SS of most proteins during or after export by a specialized protease, the leader- peptidase also called signal-peptidase. This phenomena is called cleavage and the exact location where it occurs in SS is called the Cleavage Site (CS). We can then distinguish the protein before cleavage (precursor proteins) from the proteins after cleavage (mature proteins). The cleaved SS may be released or remain in the membrane.

In bacteria, there exists a class of proteins which are cleaved by a dierent peptidase: Signal Peptidase II. These proteins, termed lipoproteins are secreted by the same Sec system but bear a dierent cleavage site called lipobox which carry the 4 residues consensus: [ILMFTV]x[AGS]C (Paetzel et al. 2002;

Gonnet et al. 2004). The cleavage site is located just before the Cysteine residue, which is modied by lipid addition after secretion.

Disruption of SS activity can lead to tremendous eects. For instance some rare genetic diseases are caused by SS mutations:

Seppen et al. (1996) has found two patient suering from the Crigler- Najjar disease and carrying an homozygote SS mutation in the bilirubin UDP-glucuronosyltransferase precursor.

Familial Expansile Osteolysis and Paget Disease of Bones can can be caused by SS mutants of the Receptor Activator of Nuclear factor-κ B (Hughes et al. 2000).

Hemophilia B is caused by a disfunction of the clotting factor IX (Roberts 1993). Among all compiled mutations in the factor IX gene, 11 falls in the preproprotein SS (Green 2004).

1.2.1 General Description

Authors (von Heijne 1990; Izard and Kendall 1994) often distinguish three contiguous parts in the SS. According to this description, SS are composed of:

A rst region rich in the positively charged AA, Lysine and Arginine, the N-region;

A region rich in hydrophobic AA, like Leucine or Alanine, H-region, also called hydrophobic core;

A third region carrying a consensus (conserved pattern of AA) recognized by the leader-peptidase, the C-region. In the bacteria Escherichia coli, this region can include negatively charged residues (Aspartic acid and Glutamic).

(37)

Figure 1.6: Examples of E. coli Signal Sequences aligned on the cleavage site. Charged AA are indicated on the AA letter. A possible hydrophobic core is shown by a gray box. The CS location is indicated by the solid black vertical line. MBP: Maltose Binding Protein, MalE (P0AEX9); PhoA: Alkaline Phosphatase (P00634), LPP: Major outer membrane lipoprotein (P69776); RBP: Ribose Binding Protein (P02925); OmpA:

Outer membrane protein A (P0A910); LamB: Lambda receptor protein/Maltoporine (P02943). Note that LPP is cleaved by Signal Peptidase II (see above).

(38)

1.2. The Signal Sequence 13

Figure 1.7: The Signal Sequence model. The bold N represents the N-terminus of the protein.

Length

Frequency

10 20 30 40 50 60

05101520253035

Figure 1.8: Histogram of the SS length of 104 well dened E. coli SS (see Sec. 2 on page 123)

The Fig. 1.7 illustrates this model. We must point out that these regions are not clearly dened. For instance the C-region carries often only hydrophobic residues and thus can not be delimited from the H-region. Furthermore polar AA are often presents in the H-region. The variability in length makes alignment dicult if not impossible. Fig. 1.8 shows the distribution of SS lengths in our E. coli datasets.

1.2.2 Amino-Acids Composition and Sequence Logos

Signal Sequences dier signicantly in composition when compared to the N- terminal region of non secreted proteins. These dierences are illustrated for E. coli proteins in Fig. 1.9. The 20 rst AA of secreted proteins are notably enriched in Alanine, Lysine, Leucine, Methionine and Serine and slightly in Cysteine, Phenylalanine. In contrast, they are impoverished in Aspartic and Glutamic Acids, Histidine, Asparagine, Glutmanine and Arginine. However, although SS are hydrophobic, they are not enriched in Isoleucine and Valine.

Another way to represent some characteristics of SS are sequence logos (Schnei- der and Stephens 1990). The logos represent alignments of sequences as stacks,

(39)

A C D E F G H I K L M N P Q R S T V W Y

Secreted (l=20) Secreted (l=50) Non secreted (l=20) Non secreted (l=50)

Abundancy in percent 0.000.050.100.15

Figure 1.9: Amino-Acid composition of the N-terminal region of 104 secreted and 161 non secreted E. coli proteins (see2 on page 123). The abundancy is presented for the rst 20 and 50 AA.

(40)

1.2. The Signal Sequence 15 corresponding to dierent position in the protein. The height R(i) of a stack at a position i is proportional to the conservation of a position (the more the position is conserved, the more the stack is high) and is dened by:

R(i) =log₂20− X20

j=1

f_j(i)log₂f_j(i) (1.1) where f_j(i) is the frequency of the AA j at the position i. This quantity is measured in bits (Shannon 1948) and the maximum for a position is log₂20≈ 4.3 (for example, if only Alanines were present in all proteins at position+5, the corresponding stack will be approximately 4.3 bits high). The relative height of the each AAr_j(i) is proportional of its relative abundance in the position in the alignment:

rj(i) = f_j(i) P₂₀

j=1f_j(i) (1.2)

The Fig. 1.10 and 1.11 show sequence logos for E. coli SS proteins and E.

coli proteins without SS. We can clearly dierentiate both class and we can verify that the SS proteins are more structured (they achieved scores of more than 1 bit) than the non-SS. Furthermore, the N- and H-regions can be guessed:

the N-region (rich in lysine and Arginine) extends from positions 1 to 3 or 4 and H-region (rich in Leucine, Alanine and valine) follows. However, in contrast to the above description, the H-region also includes non hydrophobic AA, like Tyrosine, Serine or Threonine.

The variable lengths of SS makes alignment dicult, so the logo are blurred for position far from the N-terminus. However, the C-region consensus appears clearly, if we align the sequences on the CS as shown in Fig. 1.12. We can see clearly the importance of position −1 and −3 which are highly conserved. At position -1 Alanine, Serine and Glycinne are by far the most frequent residues.

Although Alanine is the most frequent at position -3 opther residues are also proeminent (Val, Ser, Leu, Thr).

It is interesting to note that conserved positions extends in the mature sequence with negatively charged residues at position +2 (Aspartic and Glutamic acid). With such conserved pattern contrasting with the rest of SS, it is easy to think that the CS consensus must play a role in SS recognition. However, (Kuhn and Wickner) have described three mutants of M13 procoat protein in positions -6 (Serine to Proline), -3 (Serine to Phenylalanine) and -1 (Alanine to Threoine) well translocated but not cleaved. They shown thus that this consensus plays only a role in SS cleavage by the leader-peptidase which occurs after (or during) translocation.

(41)

0 1

bits

N^H^W1

P EGT FSMQVI

L A

R N

K

2

DA N^G H^Q

FVM

S

T

I

L R K

3

M HW ENPAQY F

R

V

T I

L ^S

K

4

G MC PS RW FN

K

QAT

V Y I

L

5

Q M G CAW HVY

RFT

P K S I

L

6

P H ECM NR^G

KSYVT F

I

L A

7

MC PQ

NG RV FA^S

K T I

L

8

W NQ MC

PRTGVI F

S Y

L A

9

KQ HW

RNCMGT

I

F S A V

L

10

N^Q R^CY

FKS PVTI

H^G

L A

11

NKC

PFG

M S

T

V

I

L A

12

ER

PN

FMCVG

S T A I

L

13

PNCQ

FRMT

G V S

I

L A

14

KPCMT

V

I

F S

L G A

15

PY NCW

RGM

V A

T

I

L F S

16

HY RNW

PM FV

G T

L ^S I

A

17

PDN EHMQYCG

F T

L V S A

18

W K F P EHNCYTI

LGQM

S V A

19

KR EY

DN

PFMWV

H^Q

L

G

S A

20

HMI DCYW PFGV NT

L^Q

E S A

21

K M RPI DYCVT EHNGSW

F^Q

L A

22

K HWYI RM FN D^G LS

EP^Q VT

A

23

WI Y HC K FS LD EPMT NGQ RVA

24

I MW FYQS KP ELT DR^G NVA

25

R FYCW LMHNKEPDQGSAVTI

C

Figure 1.10: Sequence Logo of 104 E. coli SS (see2 on page 123) produced with the alpro and makelogo programs from T. D. Schneider. (available online at http://www.lecb.

ncifcrf.gov/~toms/). Each stack represents a position after the initial Methionine, so the third stack represents the position +4 The colors are chosen according to the basic properties of the AA: green: polar AA (S, G, T, Y, C); purple: neutral (N, Q);

blue: basic and positively charged (H, K, R); red: acidic and negatively charged (D, E); black: hydrophobic (A, F, I, L, M, P, V, W).

0 1

bits

N^H^D^G^C^Y^V1

PMI RE LF^Q

N

K S A T

2

C M G YS PH FA NV LRDKEQTI

3

C HW FNDMRPGAYT ELKSQVI

4

MGS FKHV PR^Q ELDNYATI

5

MW G C H QY EN FS DPKI RTVA

L

6

HSWY N D EGI RFV PQAT KL

7

FY RW KPG NHVT DEQASI

L

8

MCQ NY KSG DV RT FPAI

EL

9

HY F PKTGS ENAVQI DR L

10

KCMY NSW FHT EI RV PQG DLA

11

M DYCW H P NQ FK ERSGAVI

LT

12

C MT PNHQ FDKYVSG E LRAI

13

H MWT RNSI PFKQY DV EAG L

14

C HWV NSY DMQ P EFATI RK LG

15

CW M Y V G DT KA F H RS P E NQ LI

16

FRMPKYVWTI ESQ LHNDGA

17

HC MY FNDWQT PRKV LGSAI

E

18

CH M FQNY PKVTS REAI

DLG

19

C HMWY NT D PQ K F ESGI RV LA

20

FW HMCSYQ NVTI R KDG E PA L

21

C MHYTW KP D NV FGI

LERQSA

22

S W MY H N C FP DKQTGVI EA LR

23

W MC DYSI NP KRT EFQGVA L

24

G C FW PMSVT NDYI RQ HK EA L

25

W C Q NI KMYG HE RP DVST LA

C

Figure 1.11: Sequence Logo of 161 E. coli non-SS sequences (see 2 on page 123). The logo was produced as detailed on Fig. 1.10. The rst residue was removed. We have found no explanation for the important variability at position 15 (almost equiproba- bility).

Understanding signal sequences with machine learning

Thesis

Reference

Understanding signal sequences with machine learning

Understanding Signal Sequences with Machine Learning

THÈSE

Jean-Luc Falcone

Remerciements

Contents

List of Figures

List of Tables

Abbreviations and Conventions

List of Abbreviations

Pseudocode

Abstract

Résumé

Chapter 1

The Signal Sequence

1.1 Proteins

1.1.1 Amino-Acids

1.1.2 Structure and modications

1.1.3 Protein Targeting

1.1.4 Protein Secretion

1.2 The Signal Sequence

1.2.1 General Description

1.2.2 Amino-Acids Composition and Sequence Logos

L A

R N

K

S

I

L R K

R

T I

L S

K

K

V Y I

L

P K S I

L

I

L A

K T I

L

S Y

L A

I

F S A V

L

L A

M S

V

L A

S T A I

L

G V S

L A

V

F S

L G A

V A

I

L F S

G T

L S I

A

F T

L V S A

S V A

L

S A

E S A

L A

A

K S A T

L ^S

L ^S I