• Aucun résultat trouvé

Analysis of large biological data: metabolic network modularization and prediction of N-terminal acetylation

N/A
N/A
Protected

Academic year: 2022

Partager "Analysis of large biological data: metabolic network modularization and prediction of N-terminal acetylation"

Copied!
215
0
0

Texte intégral

(1)

Thesis

Reference

Analysis of large biological data: metabolic network modularization and prediction of N-terminal acetylation

CHARPILLOZ, Christophe

Abstract

During last decades, biotechnology advances allowed to gather a huge amount of biological data. This data ranges from genome composition to the chemical interactions occurring in the cell. Such huge amount of information requires the application of complex algorithms to reveal how they are organized in order to understand the underlying biology. The metabolism forms a class of very complex data and the graphs that represent it are composed of thousands of nodes and edges. In this thesis we propose an approach to modularize such networks to reveal their internal organization. We have analyzed red blood cells' networks corresponding to pathological states and the obtained in-silico results were corroborated by known in-vitro analysis. In the second part of the thesis we describe a learning method that analyzes thousands of sequences from the UniProt database to predict the N-alpha-terminal acetylation. This is done by automatically discovering discriminant motifs that are combined in a binary decision tree manner. Prediction performances on N-alpha-terminal acetylation are higher than the other published classifiers.

CHARPILLOZ, Christophe. Analysis of large biological data: metabolic network modularization and prediction of N-terminal acetylation . Thèse de doctorat : Univ.

Genève, 2015, no. Sc. 4883

URN : urn:nbn:ch:unige-860463

DOI : 10.13097/archive-ouverte/unige:86046

Available at:

http://archive-ouverte.unige.ch/unige:86046

Disclaimer: layout of this document may differ from the published version.

(2)

UNIVERSITÉ DE GENÈVE FACULTÉ DES SCIENCES

Département d’informatique Professeur Bastien Chopard

Analysis of Large Biological Data:

Metabolic Network Modularization and Prediction of N-Terminal Acetylation

THÈSE

présentée à la Faculté des sciences de l’Université de Genève

pour obtenir le grade de Docteur ès sciences, mention sciences informatiques

par

Christophe CHARPILLOZ

de Bévilard (BE)

Thèse No 4883

GENÈVE

Atelier de reproduction Uni-Mail 2015

(3)
(4)

R E M E R C I E M E N T S

Je souhaite commencer par remercier mon directeur de thèse, Bastien Chopard, pour m’avoir offert l’opportunité d’accomplir un doctorat au sein du laboratoire de calcul scientifique et parallèle (Scientific and Parallel Com- puting Group, SPC). Sa curiosité et son intérêt dans le domaine des sciences computationnelles m’ont permis d’explorer librement et rigoureusement le domaine de l’analyse du métabolisme in silico. Ses encouragements et son appui m’ont aidé à terminer ce travail dans les meilleures conditions possibles. Je remercie aussi chaleureusement Jean-Luc Falcone pour son en- cadrement ainsi que pour toute l’aide qu’il m’a apportée. Ses conseils, allant de la biologie à la rédaction scientifique, m’ont permis de mener à bien ce travail.

J’exprime également toute ma gratitude envers les membres du jury. À Anne-Lise Veuthey pour son expertise en protéomique, ses suggestions et ses remarques sur mon travail. À Alexandre Masselot pour avoir aussi accepté de mettre ses compétences de (bio)informaticien à disposition pour évaluer la qualité de mon travail.

Je remercie Alexandros Kalousis pour avoir partagé ses compétences dans le domaine de l’apprentissage automatique (machine learning) et de l’explo- ration de données (data mining). Son assistance et ses enseignements dans ces domaines ont été d’un grand secours. Je suis aussi reconnaissant envers Felix Kwok, Martin Jakob Gander et Pierre-Alain Cherix. En effet, grâce à leur savoir-faire mathématique et à leur gentillesse, une section complète de ce manuscrit a pu être réalisée.

Ce travail n’aurait pas été possible sans le soutien de nombreuses per- sonnes, en commencant par ceux avec qui j’ai passé la quasi totalité de mes années au SPC. Merci à Orestis Pileas Malaspinas dont le soutien scientifique et amical ainsi que l’expérience dans l’encadrement de travaux académiques m’ont été d’une très grande aide. Un grand merci aussi à mon collègue et ami Daniel Walter Lagrava Sandoval dont la compagnie a été très apréciée et a contribué à rendre mon parcours académique stimulant et amusant. Je remercie également Xavier Meyer dont les échanges m’ont permis d’aborder mon travail avec plus de calme et de sérénité.

Je n’oublie évidemment pas mes collègues du SPC et membres du dépar- tement des sciences informatiques avec qui j’ai partagé de nombreux bons moments et qui ont aussi supporté mes sauts d’humeurs. Merci à Alexandre, Andrea, Aziza, Gregor, Jonas, Kae, Mohamed, Pablo, Pierre, Reto et Yann.

Certains d’entre eux sont devenus des amis avec qui j’espère garder le contact bien au-delà de ce travail de doctorat. Merci aussi à toutes les personnes qui n’ont pas été mentionnées avec qui j’ai interagi pendant toutes ces années.

Finalement, une immense reconnaissance à ma mère et à mon père qui m’ont encouragé et soutenu de manière constante et indefectible du début à la fin de ce travail. Sans eux, ce travail ne serait certainement jamais abouti.

(5)
(6)

A B S T R A C T

Biotechnology allowed to gather a huge amount of biological data. Those data range from the nucleotides that compose the genome to the chemical interactions occurring between molecules in the cell. Some of these data can be interpreted by expert but some others need the application of complex al- gorithms in order to extract knowledge. The development of such algorithms is now a major research field in computational biology (or bioinformatics). In this work we develop such approach to analyze two types of biological data:

the stoichiometric models and proteins sequences to discover how these data are structured or organized in order to understand the underlying biologyin silico.

Chapter 1 and 2 are introduction to the basic concepts needed to un- derstand this manuscript. In the first chapter basic molecular biology is introduced. This allow the reader to have an intuition of what are the objects represented by the data extracted from the biological databases. In the sec- ond chapter the models used to mathematically represent the metabolism or metabolic network are described. Namely the stoichiometric matrices and the graphs.

In thechapter3the problem of extreme pathways computation is tackled.

An algorithm based on network reduction and hierarchical computing of the extreme pathways is described in details. To implement our algorithm the concept of meta-reaction is introduced. A meta-reaction is a grouping of chemical reactions’ subset connected by their substrates or products in the network. A meta-reaction summarizes the subset, or subsystem, only by its inputs and outputs. Thus ignoring the intermediate metabolites and allows the reduction in size of the network. The meta-reactions are built with respect to the stoichiometry of the encapsulated subsystems. Also experiments that allows to assess the efficiency of the reduction and hierarchical computation are described in this chapter. The latter ends by the description of a new approach allowing the detection of intractable systems by considering the reduced network with the meta-reactions

Chapter4contains a description of a metric based on the extreme pathways to measure the similarity between chemical reactions in a metabolic network.

This metric allow the usage of clustering algorithms to detect functional modules in the network. As the definition of the proposed metric needs the complete enumeration of the extreme pathways, an approximation of the metric is proposed. Then to assess the quality of the detected modules, we applied the approach to the human erythrocyte metabolism. Also a quantitative experiment that detect pair of co-expressed genes has been done.

This allows producing a score for our modules and thus comparing our metric with other approaches.

As we also propose a supervised learning method to predict the initiator methionine cleavage and Nα-terminal acetylation. Thus theChapter5is a reminder on supervised learning. It contains also a review of the already existing approach to detect the Nα-terminal acetylation. Then theChapter6 provides the description of criteria allowing to fetch the proteomic datasets.

Those datasets are the one used as learning and test datasets for our model.

Our model is described and evaluated inChapter 7and 8. The model is based on combination of discriminant motifs in a binary decision tree

(7)

manner. A discriminant motif allows to select a protein according the level of detection of the motif in the protein’s primary structure. We called our model motifs-tree. Such a tree recursively split a proteins’ set into two subsets: one undergoing a given post-translational modification, the other does not. To select the motifs that compose the decision tree’s nodes an evolutionary algorithm was used to explore the space of all variable size motifs. Then our model is compared to the state of the art. Our automatically built model provides score on par with the experts’ state of the art. Moreover it has been able to detect subtle features to correctly identify acetylated sequences which have not been detected by experts (e.g.the proteins acetylated by NatB and NatC). The model was also used to explain the initiator methionine cleavage and Nα-terminal acetylation inH. sapiens. This was successfully done for the Nα-terminal acetylation but with less success in the case of Nα-terminal acetylation. Indeed for the latter the wide range of acetylated proteins makes the model difficult to analyze.

Chapter 9 is the final chapter and contains a general conclusion about this work and briefly assess the problem of validation of bioinformatics approaches. We also bring out the growing role of computer science in biology.

(8)

R É S U M É

La biotechnologie a permis de récolter de larges quantités de données biologiques, allant des séquences de nucléotides qui composent le génome jusqu’aux intéractions chimiques des molécules nécessaires à la vie cellulaire.

Si certaines de ces données peuvent être facilement analysées ou interpré- tées par les experts, d’autres, pour des raisons de taille ou de complexité, nécessitent des approches algorithmiques pour pouvoir les exploiter et en extraire au mieux des connaissances. Développer de telles approches pour comprendre ces données est un enjeu majeur en bioinformatique et un do- maine de recherche actif. Au cours de ce travail nous avons contribué dans cette direction en proposant des approches pour analyser deux types de don- nées biologiques : les modèles stœchiométriques et les séquences protéiques.

Le but commun étant d’analyser de larges bases de données biologiques pour tenter de découvrir comment ces données sont biologiquement structu- rées (dans un large sens) afin de tenter de comprendrein silicola biologie sous-jacente.

Lepremier chapitreest une introduction aux objets biologiques sur les- quels s’appliquent les approches analytiques décrites dans ce manuscrit. Les notions de base de biologie moléculaire y sont exposées mais sans avoir la prétention d’être un support approfondi ou exhaustif. Dans ledeuxième chapitre, une autre brève introduction décrit deux modèles permettant de représenter mathématiquement le métabolisme (ou réseau métabolique) : les matrices stœchiométriques ainsi que les graphes.

Après ces introductions, nous traitons de l’analyse du métabolisme in silico. En effet, les modèles stœchiométriques et les graphes permettent de représenter l’ensemble connu des réactions chimiques se produisant dans la cellule. Ces modèles offrent aux scientifiques des objects mathématiques pouvant être manipulés et traités afin d’en extraire de la connaissance, telle l’analyse de voies métaboliques ou de modules fonctionels.

Dans le troisième chapitre, le problème du calcul desextreme pathways dans un réseau métabolique est abordé, unextreme pathwayétant une descrip- tion mathématique d’une voie métabolique. Le problème d’énumération de cesextreme pathwaysest généralement intraitable (c’est-à-dire que la solution n’est pas trouvée dans un tempssuffisammentcourt) lorsqu’il est appliqué à des réseaux métaboliques à l’échelle génomique. Nous proposons donc une approche pour calculer lesextreme pathwaysde manière hierarchique, ceci dans le but de traiter un ensemble de problèmes sur des réseaux pluspetits (c’est-à-dire comportant peu de réactions chimiques) et donc potentiellement plus simples. L’idée présentée remplace récursivement des sous-ensembles de réactions chimiques connectées entre elles par des métabolites par des meta-réactions. Une meta-réactions représente donc un sous-système et le résumeuniquement par les métabolites qu’il consomme et qu’il produit en faisant abstraction des métabolites intermédiaires. Ceci tout en respectant la stœchiométrie imposée par le sous-système. L’utilisation des meta-réactions permet de réduire la taille du graphe (ou réseau métabolique), dans l’es- poir d’obtenir un système dont le calcul desextreme pathwaysse retrouve simplifié. Une description détaillée de l’algorithme est fournie ainsi qu’une formalisation en terme d’opérations matricielles. Le chapitre traite ensuite des problèmes de performances de l’utilisation des meta-réactions et se

(9)

termine par la présentation d’une nouvelle approche pour identifier les sys- tèmes intraitable. Ceci à l’aide de la réduction du graphe par le biais des meta-reactions.

Lequatrième chapitredécrit une métrique exploitant lesextreme pathways pour calculer une similarité entre les réactions chimiques qui composent le graphe (ou réseau métabolique). Cette similarité est utilsée pour procéder à des opérations declustering(partitionnement de données) pour détecter des modules fonctionels au sein de la cellule. Comme la construction de la métrique nécessite le calcul complet desextreme pathways, une solution est poposée pour approximer cette similarité sans avoir à calculer l’ensemble complet desextreme pathwaysdu réseau métabolique. Pour valider la détec- tion de modules fonctionels, nous l’avons appliquée au métabolisme des érythrocytes chez l’humain. Les résultats ont permis l’extraction de modules déjà bien identifiés, confirmant qualitativement l’approche proposée. Ensuite des enzymopathies ont étésimuléesafin d’évaluer leurs conséquences sur les modules déctectés. Ceci a permis de déduire correctement certaines alté- rations du métabolisme des érythrocytes dûes à certaines des pathologies étudiées. Pour terminer, une validation quantitative a été appliquée pour détecter les paires de gènes co-exprimés dans un réseau métabolique. Ceci permettant de calculer un score mesurant la qualité des modules détectés et donc de comparer notre approche à celles déjà publiées dans la littérature.

La deuxième partie est consacrée à l’analyse de séquences protéiques.

Les bases de données, telles que UniProtKB, contiennent des centaines de milliers de séquences annotées. Ces annotations fournissent de nombreuses informations sur les séquences, allant du gène encodant le polypeptide jusqu’à l’organisme à partir duquel il a été identifié. Dans ce manuscrit, nous décrivons comment ces données ont été exploitées pour automati- quement construire un modèle permettant de prédire deux modifications co-traductionelles : le clivage de la méthionine initiale et l’acétylation N- terminale. Lecinquième chapitre est un bref rappel sur l’apprentissage supervisé ainsi que ses applications au problème de prédiction de l’acétyla- tion N-terminale. Les modifications post-traductionelles ont, quand à elles, été introduites dans le premier chapitre.

Lesixième chapitretraite sur la construction des ensembles de données nécessaires aux algorithmes d’apprentissage supervisés. Des critères per- mettant la sélection des séquences adéquates ont donc été développés. Ces derniers étant loin d’être triviaux, un chapitre leur est donc dédié.

Leseptième chapitreest consacré au modèle choisi pour prédire les deux modifications post-traductionelles considérées. Nous avons opté pour un arbre de décision binaire combinant des motifs discriminants. Un motif discriminant permet de séléctionner une protéine si cette dernière contient à un niveau suffisamment significatif le motif. Dans le cadre de ce travail, ce modèle a été nommémotifs-tree. Le but d’un tel arbre de décision est de récursivement séparer un ensemble de protéines en deux sous-ensembles : l’un contenant les protéines subissant une modification, l’autre non. La contrainte computationelle étant la recherche des motifs qui composent les nœuds de l’arbre. Pour ce faire, un algorithme évolutionaire a été utilisé pour rechercher dans l’espace de motifs de taille variable, les motifs les plus discriminants pour chaque niveau de l’arbre.

Dans lehuitième chapitre, les résultats obtenus par les motifs-trees sont présentés et comparés avec l’état de l’art du moment (en août2015, date de rédaction de ce manuscrit). Il est démontré qu’une méthode automatique produit des performances équivalentes à celles obtenues par des experts en

(10)

r é s u m é

protéomique. Dans certains cas particuliers, le modèle présenté est capable de détecter de subtiles caractéristiques dans les séquences qui n’ont pas été détectées par les experts (comme les protéines acétylées par la NatB ou NatC). Les modèles produits ont aussi été utilisés pour comprendre quelles sont les caractéritiques nécessaires pour que ces modifications enzymatiques aient lieu chez l’humain. Ceci à été accompli avec succès pour le clivage de la méthionine initiale et avec une efficactié moindre dans le cas de l’acétylation N-terminale, étant donné la complexité du modèle produit et la grande variété des protéines acétylées en leur extrémité N-terminale.

Le dernier chapitre est une conclusion générale et un court essai trai- tant brièvement des problèmes de validation des approches proposées dans certains domaines de la bioinformatique ainsi que de l’importance de l’infor- matique dans la biologie.

(11)
(12)

C O N T E N T S

r e m e r c i e m e n t s iii

a b s t r a c t v

r é s u m é vii

Contents xi

List ofFigures xv

List ofTables xix

i b a s i c m o l e c u l a r b i o l o g y 1

1 b i o l o g i c a l c o n c e p t s 3

1.1 Proteins . . . 3

1.2 Genes . . . 5

1.3 Enzymes . . . 10

1.4 Post-translational modifications . . . 11

1.4.1 Initiator methionine cleavage . . . 12

1.4.2 Nα–terminal acetylation . . . 13

1.5 The metabolism . . . 16

ii m e ta b o l i c n e t w o r k a na ly s i s 21 2 m e ta b o l i c n e t w o r k m o d e l s 23 2.1 Stoichiometric modeling . . . 23

2.1.1 Metabolic pathways . . . 27

2.2 Graph modeling . . . 30

2.3 Metabolic network reconstruction . . . 31

3 h i e r a r c h i c a l c o m p u tat i o n o f e x t r e m e pat h way s 35 3.1 Overview of the approach . . . 36

3.2 Simplifying metabolic networks . . . 36

3.3 Meta-reaction . . . 37

3.4 Metabolic subnetworks . . . 42

3.5 Packing the metabolic network . . . 44

3.6 Extreme pathways unpacking . . . 45

3.6.1 The straightforward case . . . 46

3.6.2 The shared case . . . 46

3.6.3 The encapsulated case . . . 47

3.6.4 Last independence check . . . 47

3.7 Matrix description of the complete algorithm . . . 49

3.7.1 The constraints matrix . . . 51

3.7.2 Example . . . 52

3.8 Efficiency of the network packing . . . 55

3.8.1 Random subnetwork packing . . . 56

3.8.2 Hierarchical packing . . . 59

3.8.3 Comparison of the results . . . 63

3.9 Performance of hierarchical extreme pathways computation . 64 3.10 Detection of intractable systems . . . 68

3.11 Conclusion and perspective . . . 74

4 m o d u l e d e t e c t i o n i n m e ta b o l i c n e t w o r k 77 4.1 Motivation . . . 77

4.2 An extreme pathways similarity measure . . . 78

4.2.1 Example of extreme pathways similarity . . . 81

4.3 Theε-graph . . . . 81

(13)

4.4 Hierarchical clustering . . . 84

4.5 Computation of the distance . . . 84

4.5.1 Approximation . . . 86

4.6 Red blood cell functional modules analysis . . . 89

4.6.1 Glucose-6-phosphate dehydrogenase deficiency . . . . 98

4.6.2 Pyruvate kinase deficiency . . . 99

4.7 Cluster analysis of theE. colimetabolism . . . 103

4.7.1 Detection of intra-operonic pairs of genes inE. coli . . 105

4.7.2 Exploring genes pairs inE. coli . . . 111

4.8 Conclusion . . . 113

iii s e q u e n c e a na ly s i s 117 5 b a c k g r o u n d i n p o s t-t r a n s l at i o na l m o d i f i c at i o n s c l a s- s i f i c at i o n 119 5.1 Classification . . . 119

5.2 Prediction of Nα-terminal acetylation . . . 121

6 p r o t e i n s d ata s e t s 125 6.1 General criteria . . . 125

6.2 Nα-terminal acetylation criterion . . . 125

6.3 Non-Nα-terminal acetylation criteria . . . 128

6.4 Quality of the datasets . . . 128

6.5 Datasets composition . . . 129

6.6 Conclusion . . . 130

7 m o t i f s-t r e e s 135 7.1 Motivation . . . 135

7.2 Sequence motif . . . 135

7.2.1 Aligned motif . . . 136

7.3 Tokens . . . 137

7.3.1 Any amino acid . . . 138

7.3.2 Fixed amino acid . . . 138

7.3.3 Included or excluded amino acids . . . 138

7.3.4 Amino acid physicochemical similarity . . . 138

7.4 Motif search by genetic algorithm . . . 140

7.4.1 Individual . . . 140

7.4.2 Genetic operators . . . 141

7.4.3 Fitness computation . . . 143

7.5 Motifs-tree: motif combination . . . 144

7.5.1 Motifs-tree growth . . . 146

7.5.2 Motifs-tree pruning . . . 146

8 m o t i f s-t r e e s p e r f o r m a n c e s a n d p r o t e o m i c a na ly s i s f o r h. sapiens 149 8.1 Initiator methionine cleavage . . . 149

8.1.1 Parameters selection and classification performance . 149 8.1.2 Human MetAPs specificity analysis . . . 151

8.2 N-terminal acetylation . . . 159

8.2.1 Classification performance . . . 159

8.2.2 NatB and NatC potential substrates . . . 160

8.2.3 Can a motifs-tree learn likeMartinez et al.? . . . 163

8.3 Human NATs specificity analysis . . . 164

8.3.1 The root motif . . . 166

8.3.2 The second motif . . . 170

8.3.3 UniProtKB release2015_07 . . . 173

8.4 Ensemble learning . . . 173

(14)

c o n t e n t s

8.4.1 Motifs forest . . . 174 8.4.2 Classification performances of the motifs forest . . . . 176 8.5 Conclusion and perspective . . . 177

iv c o n c l u s i o n 179

c o n c l u s i o n & p e r s p e c t i v e 181

b i b l i o g r a p h y 183

Curriculum vitae 193

(15)
(16)

L I S T O F F I G U R E S

Figure1 Structure of an amino acid. . . 3

Figure2 Peptide bond formation between two amino acids. . 5

Figure3 Representation of a polypeptide. . . 5

Figure4 The steps involved in the biosynthesis. . . 6

Figure5 Example of a small partial regulatory network. . . 9

Figure6 Illustration of the feedback inhibition. . . 11

Figure7 N-terminal acetylation by N-terminal acetyltransferases 14 Figure8 Illustration of the three stages in the metabolism. . . 17

Figure9 The Kyoto encyclopedia of genes and genomes metabolism map. . . 18

Figure10 A simple linear pathway. . . 19

Figure11 Example of high dimensional cone in the fluxes-space. 26 Figure12 Example of a simple system and its two extreme path- ways. . . 28

Figure13 Example of a directed and undirected graph. . . 30

Figure14 A directed bipartite stoichiometric graph. . . 31

Figure15 Transformation of a bipartite metabolic network into a reactions network and a compounds network. . . . 32

Figure16 A schematic view of the XML KEGG files. . . 34

Figure17 Four cases of reactions which will never be part of an extreme pathway. . . 37

Figure18 Derivation of a meta-reaction from a system of chem- ical equations. . . 40

Figure19 Derivation of two meta-reactions from a system of chemical equations. . . 40

Figure20 Derivation of two meta-reactions from a system of chemical equations. . . 41

Figure21 A network composed of four exchange fluxes. . . 43

Figure22 Packing of a network with meta-reactions. . . 46

Figure23 Wrong extreme pathways matrix reconstruction. . . . 48

Figure24 Metabolic sub-network improperly extracted from the network. . . 48

Figure25 Packing of metabolic network with cycle. . . 49

Figure26 The metabolic network and its division. . . 52

Figure27 The chosen metabolic subnetworks. . . 52

Figure28 The packed network. . . 55

Figure29 Fowlkes-Mallows index between all pairs of50ran- domly packedE. colinetworks . . . 59

Figure30 Distribution of the ratios of the rejected partition dur- ing a compression step. . . 60

Figure31 Binary split of the vertices of a graph. . . 60

Figure32 The pruning process and the selection process of the partitions . . . 61

Figure33 Conversion of one metabolite into an external metabo- lite. . . 63

Figure34 Sizes of the sample networks used to assess the per- formance of hierarchical packing. . . 65

(17)

Figure35 Plot of the computation times on the uncompressed networks versus the compressed networks. . . 66 Figure36 Percentage of a24hours timeouts in function of the

number of vertices. . . 67 Figure37 Plot of the computation times on the samples net-

works versus the vertices numbers composing the samples. . . 68 Figure38 Plot of the log of the degrees distribution for the

compounds and reactions of the10easynetworks. . . 69 Figure39 Empirical cumulative function and log of the degrees

distribution for the reactions. . . 71 Figure40 Empirical cumulative function and the log of the dis-

tribution of the common compounds degrees in the K12networks. . . 72 Figure41 Two samples Kolmogorov-Smirnov tests for all pairs

of network having similar sizes. . . 74 Figure42 Atoymetabolic network. . . 81 Figure43 Extreme pathways of thetoymetabolic network. . . . 81 Figure44 The two clusters and three clusters produced by the

spectral clustering algorithm. . . 83 Figure45 Hierarchical clustering of the toy metabolic network

using UPGMA. . . 85 Figure46 Measure of the quality of the extreme pathways dis-

tance approximations. . . 88 Figure47 Network representation of the erythrocyte’s metabolism. 91 Figure48 Undirected hierarchical clustering of the erythrocyte

metabolism and the resulting modules. . . 94 Figure49 Separation of the non-oxidative PPPundirectedmodule

into twodirectedmodules. . . 96 Figure50 Directed hierarchical clustering of the erythrocyte

metabolism and the resulting modules. . . 97 Figure51 Directed hierarchical clustering of an healthy and a

G6PD deficient erythrocyte. . . 100 Figure52 Directed hierarchical clustering of a G6PD deficient

erythrocyte and the resulting modules. . . 101 Figure53 Directed hierarchical of an healthy and a PK defi-

ciency erythrocyte. . . 102 Figure54 Directed functional modules of an PK deficiency ery-

throcyte. . . 104 Figure55 Distribution of extreme pathways distances inE. coli

for all pairs of reactions. . . 105 Figure56 Hierarchical clustering of the reaction in the E. coli

metabolism. . . 106 Figure57 Cluster isolated after a cut in the dendrogram in the

E. colimetabolism. . . 107 Figure58 Performance of the intra-operonic pairs detection in

E. colithrough the hierarchy. . . 108 Figure59 Receiver operating characteristic curve for the detec-

tion of intra-operonic pairs inE. coli. . . 110 Figure60 Comparison of three different linkage criteria for the

detection of intra-operonic pairs inE. coli. . . 110 Figure61 Extracted subnetwork for gpp, spoT, ppx and relA. . 113

(18)

l i s t o f f i g u r e s

Figure62 Sequence logo for the initiator methionine cleavage of the2012dataset. . . 131 Figure63 Sequence logo for the initiator methionine cleavage of

the2015dataset. . . 132 Figure64 Sequence logo for the Nα-terminal acetylation of the

2012dataset. . . 133 Figure65 Sequence logo for the Nα-terminal acetylation of the

2015dataset. . . 134 Figure66 The crossover operators. . . 141 Figure67 The mutation operators. . . 142 Figure68 Example of bloat in alignments with a motif contain-

ing bloat and a motif without bloat. . . 142 Figure69 A graphical representation of a motifs-tree. . . 145 Figure70 The motifs-tree predicting the initiator methionine

cleavage for from theH. sapiens2012dataset. . . 152 Figure71 The motif score profile difference forH. sapiensinitia-

tor methionine cleavage root node. . . 155 Figure72 The motif score profile difference forH. sapiensinitia-

tor methionine cleavage node at depth one. . . 157 Figure73 The motif score profile difference forH. sapiensinitia-

tor methionine cleavage node at depth two. . . 158 Figure74 The motifs-tree for the prediction of Nα-terminal acety-

lation inH. sapiens. . . 165 Figure75 Manually pruned motifs-tree for Nα-terminal acetyla-

tion prediction inH. sapiens. . . 167 Figure76 The average scores difference and histograms of aligned

positions of the root motif the motifs-tree predicting Nα-terminal acetylation inH. sapiens. . . 169 Figure77 Average score difference and histograms of aligned

position for the second motif in the simplified motifs- tree for the prediction of Nα-terminal acetylation in H. sapiens. . . 172 Figure78 Construction of an ensemble classifier based on deci-

sion trees. . . 175

(19)
(20)

L I S T O F TA B L E S

Table1 List of functions accomplished by proteins. . . 4 Table2 Names and abbreviations of the22amino acids. . . . 4 Table3 The protein’s structure levels. . . 6 Table4 The genetic code. . . 8 Table5 The six major classes of enzymes. . . 10 Table6 List of some common post-translational modifications. 12 Table7 Supposed substrate specificities of the six N-terminal

acetyltransferases . . . 15 Table8 The result obtained by applying the simplification

algorithm to reconstructed networks ofH. sapiensand E. coli. . . 38 Table9 The criterion to decide the type of exchange flux. . . 44 Table10 Size of the reconstructed networkE. coli. . . 56 Table11 Compression of the reconstructed E. coli metabolic

netowrk. . . 58 Table12 Packing efficiency on the reconstructedE. colimetabolic

network. . . 64 Table13 Identification of tractable systems with the Kolmogorov-

Smirnov statistic. . . 74 Table14 Partitions produced by spectral clustering of thetoy

network. . . 82 Table15 The percentage of network components represented

by a subnetwork. . . 89 Table16 The Standard deviations for each pair of parameters

used in the subnetworks sampling. . . 89 Table17 List of chemical abbreviations used in the human

erythrocyte metabolic network. . . 92 Table18 List of enzyme abbreviations used in the human ery-

throcyte metabolic network. . . 93 Table19 Outliers in the hierarchy for the considered human

red blood cell states . . . 95 Table20 Confusion matrix for the genes pairs. . . 107 Table21 The area under the curve for the detection of intra-

operonic pairs inE. coli. . . 109 Table22 Discovered pair in theE. colihierarchical clustering. . 112 Table23 Pair of genes encoding for proteins interacting with

ppGpp. . . 113 Table24 Pair of genes that encode for PLP dependent proteins. 114 Table25 Criteria used to build the Nα-acetylated and the non

Nα-acetylated datasets. . . 126 Table26 Criteria for theprotein existencein UniProtKB. . . 127 Table27 Number of sequences and content of the different

datasets extracted from the two release of UniProtKB (2012_07and2015_07). . . 130 Table28 Hydropathy index (KYTJ820101) from the AAIndex1. 139 Table29 Summarized list of the type of tokens used to build a

motif. . . 139 Table30 Numbers of possible tokens. . . 140

(21)

Table31 Parameters use to build the motifs-trees. . . 151 Table32 Results obtained by outer and inner cross-validation. 151 Table33 Results assessing the quality of the initiator methion-

ine cleavage prediction on the2012dataset. . . 153 Table34 Cross-validated results for the Nα-terminal acetylation

prediction to selected the N-terminus length on the 2012datasets. . . 159 Table35 McNemar’s tests to assess difference between classifiers160 Table36 Performance for Nα-terminal acetylation prediction

byTermiNator3on the2012datasets. . . 160 Table37 Cross-validated scores obtained byEukaryotaclassi-

fiers versus TermiNator3on NatB or NatC proposed substrates. . . 161 Table38 Predictions of proteins with known Nats using the

Terminus H. sapiensclassifier. . . 162 Table39 Performance of the motifs-trees when the models are

built on the complete 2012 datasets to predict Nα- terminal acetylation. . . 162 Table40 Results assessing the quality of the algorithm in pre-

dicting Nα-terminal acetylation with a reduced dataset.163 Table41 Motif used in theH. sapiensmotifs-tree for Nα-terminal

acetylation prediction. . . 164 Table42 Description of the used physico-chemical property

token in theH. sapiensNα-terminal acetylation motifs- tree. . . 168 Table43 Amino acid frequency for the second residue in the

H. sapiensproteome. . . 168 Table44 Simplified rules of the second motif in the simplified

motifs-tree for Nα-terminal acetylation inH. sapiens. . 170 Table45 Lysine, proline and arginine scores when aligned on

the physico-chemical property tokens of the motif at depth two. . . 171 Table46 Simplified rules of the second motif in the simplified

motifs-tree for Nα-terminal acetylation inH. sapiens. . 173 Table47 10-folds cross-validated results for Nα-terminal acety-

lation prediction with the motifs-trees on the 2015 version of the datasets. . . 173 Table48 Results assessingTermiNator3quality of prediction for

Nα-terminal acetylation. . . 173 Table49 Cross-validated performances of the ensemble learn-

ing method on the2012datasets. . . 177

(22)

Part I

B A S I C M O L E C U L A R B I O L O G Y

(23)
(24)

1

B I O L O G I C A L C O N C E P T S

In this chapter we introduce the basic biological knowledge necessary to understand the biological objects we studied in this work, namely the metabolic networks, the proteins and their chemical modifications. The reader should note that this chapter does not have the pretension of being exhaustive in the description of those biological objects. But it should allow the unfamiliar reader to get a more clearer picture of what metabolism, proteins and gene regulation are. Unless specified, the biological process described take place in eukaryotic cells. The content of sections1.1and1.3 are inspired from [Berg et al.,2002]. This reference may not be cited anymore in those sections.

1.1 p r o t e i n s

Proteins are large biological molecules synthesized by the cell and are formidable molecular machines that accomplish functions in virtually every process within the cell, like food digestion or immunity. All the diverse roles in the organism can be accomplished because of the variety in shapes and sizes of proteins. Indeed, their structures or shapes define the function of the proteins and studying it teaches biologists or biochemists how the protein functions. They are also key structural components of biological materials like cartilage, hair or spider silk. The remarkable scope of their function is exemplified in the table1.

Chemically the proteins inEukaryotaare built from a set of22amino acids (table2). Those amino acids form the basic structural unit of the protein and come in different shape, size and chemical properties. In other words amino acids are the building blocks of the protein. More precisely an amino acid is a molecule consisting of a central carbon, calledαcarbon, that bonds to an amino group (–nh2), a carboxyl group (–cooh), an hydrogen atom and a variable lateral chain, called R group. All amino acids share this common structure (figure1) and they differs only by the lateral chain (or R group).

This is this chain that confer to the amino acid its variability.

The amino acids have an important property of being able to bind to one other. The amino group of an amino acid can react to the carboxyl group to another one and form a covalent bond called peptide bond (figure 2).

Several amino acids can bind and form a linear chain called polypeptide or proteins. In such a chain an amino acid is called a residue. The residues of a polypeptide are read from the amino group (N-terminal end) to the carboxyl group (C-terminal end) (figure3).

OH O H R NH2

α

Figure1: Structure of an amino acid.

(25)

Table1: List of functions accomplished by proteins. The list may not be exhaustive.

Type Function Example

Structure Structural proteins create and maintains biological structures by giving shape to cells, tissues and organs

Collagen is a fibrosis protein found, for example, in bones, tendons and cartilage Transport Transport proteins can

bind to substances (small molecules or ions) and transport from one location to another

Hemoglobin transports oxy- gen within the erythrocytes (red blood cells) from lungs to tissues

Defense Some proteins play an active role in cell protection

Antibodies or immunoglobu- lins are proteins synthesized as a response to a foreign substance (virus, bacteria or parasite)

Regulation Proteins are signal sub- stances (hormones) or recep- tors

Insulin is an hormone produced by pancreatic cells, regulating glucose metabolism

Catalysis Enzymes are catalyzers ac- celerating biochemical reac- tions up to 1016 time faster in comparison to the non- catalyzed reaction

Trypsine is a serine protease which are enzymes cleaving peptidic bounds

Motion Some proteins provide mo- tion capability to cells, like in the cellular division or muscle contraction

Actine and myosine form the contractile system of the cells

Storage Storage proteins works as tanks for essential sub- stances

Ferritin bound to iron and allow the storage to this es- sential metal

Table2: Names and abbreviations of the22amino acids directly encoded for protein synthesis by the genetic code ofEukaryota.

Name Abbrev. Name Abbrev.

Alanine Ala a glycine Gly g

Arginine Arg r Histidine His h

Asparagine Asn n Isoleucine Ile i

Aspartatic acid Asp d Leucine Leu l

Cysteine Cys c Lysine Lys k

Glutamic acid Glu e Methionine Met m Glutamine Gln q Phenylalanine Phe f

Proline Pro p Serine Ser s

Threonine Thr t Tryptophan Trp w

Tyrosine Tyr y Valine Val v

Pyrrolysine Pyl o Selenocysteine Sec u

(26)

1.2 Genes

+

H O2

H O2

+

O H R

NH3 O

1

- +

O -

R H HN3

O

2

+

-

H R

NH3 O

O R H HN

O

2 1

Figure2: Peptide bond formation between two amino acids: the amine group reacts with the carboxyl group releasing a water molecule.

N-terminal

end C-terminal

}

end HN

R

O

H 1 O

R H2 R NH O

H 3 NH

+ 3

HN R NH O

H n-1

R Hn O_

}

O

Figure3: Representation of a polypeptide with the localization of the N- terminal end and the C-terminal end. Usually a polypeptide is read from the N-terminal end to the C-terminal end.

The sequence of amino acids that compose a protein is called the primary structure. This structure or sequence is defined by the nucleotides sequence of the gene encoding for the protein. We recall that the gene is the molecular unit of heredity of a living organism and is a segment of a deoxyribonucleic acid molecule (DNA). Proteins are biosynthesized from those DNA molecules.

Roughly, the steps of the are the followings:

1. the gene is first transcribed into a precursor messenger RNA (pre-RNA) molecule in the cell nucleus.

2. Then the pre-RNA is processed into a messenger RNA (mRNA) by removing the introns (the splicing). The introns are part of the gene that do not encode for an amino acid in the synthesized protein.

3. The next step is the translation where the mRNA is decoded by a molecular machine, called ribosome, to produce an amino acid chain.

Proteins are always synthesized from N-terminus to C-terminus end.

The figure4illustrate the steps with the different molecules involved in the biosynthesis.

Proteins mainly differ from one to another mainly by primary structure.

This primary structure induces the shape and size of the protein in the cell.

We say that the protein folds in a complex three-dimensional structure. There are four distinct levels of three-dimensional structure (secondary, tertiary and quaternary structure), they are all listed and described in table3. 1.2 g e n e s

Genes are the molecule of heredity in living organism. Their function is to encode a protein which performs the necessary cellular functions in the cell (see section1.1). A gene forms a strand of deoxyribonucleic acid (DNA).

This strand is itself part of a longer strand of DNA and twists with another to form a DNA double helix. These very long DNA double helix forms a chromosome. Hence a chromosome can encode up to thousands genes.

A molecule of DNA is a polymer composed of repeating units called nucleotides. These nucleotides are composed of a nitrogenous base, a five- carbon sugar (a deoxyribose), and one phosphate group. In the case of the DNA, four nucleotides, or bases, form the building blocks of DNA: the adenine (A), the thymine (T), the cytosine (C) and the guanine (G). Often those nucleotides are abbreviated by the letter given between the parentheses.

(27)

DNA

Protein (+) sense RNA (-) sense RNA

DNA replication Reverse

tran scription Transc

ription

RNA replication Dir

ect tradu

ction

Traduction

Unusual flowGeneral flow

Figure4: The steps involved in the biosynthesis. The arrows indicates the information flows from a molecule to another molecule. The dashed arrows represent unusual flows (e.g. reverse transcrip- tion in viruses, direct translation in cell-free systems). The (-) sense represents the antisense RNA (as RNA) which is the complemen- tary to a mRNA. The (+) sense represents the mRNA transcribed in the cell.

Table3: The protein’s structure levels and their description.

Structure Description

Primary The linear amino acids sequence of the polypeptide chain.

Secondary Local structures stabilized by hydrogen bonds between the chain peptide groups. There are two main types structure, theα-helix and theβ-strand orβ-sheets.

Super secondary Compact three-dimensional structure of several adja- cent secondary structure, likeβ-hairpins,α-helix hair- pins, andβ-α-βmotifs.

Tertiary This is the spatial positions of the secondary structures to one another, generally stabilized by nonlocal inter- actions (in other words this is the overall shape of a single protein).

Quaternary The structure formed by several protein, called protein subunits, functioning as a single complex.

(28)

1.2 Genes

They have the property to be able to bond by pair by forming hydrogen bonds. These bonds hold together the two strands of the DNA helix. The binding is made between the pairs A–T and G–C.

To pass the flow of information from a gene to a protein, the DNA goes through several steps. The first step is the transcription into messenger ribonucleic acid (mRNA) which is a molecule very similar to the DNA. The main difference between DNA is that it uses uracil (U) instead of thymine.

Moreover the RNA is mainly a single-stranded molecule. However, it can forms double stranded molecule by complementary base pairing as it is in transfer RNA (tRNA). The mRNA is then composed of the complementary bases of a DNA strand. It consists of four steps:

1. the pre-initiation and initiation are the events that allow the molecular machine called the RNA polymerase to bind with the DNA. This ma- chine produces primary transcript RNA. At first, the RNA polymerase does not bind directly but rather to proteins called the transcription factors. Those transcription factor bind to region of DNA, called pro- moter, during the pre-initiation step and facilitate the binding of the RNA polymerase.

2. The promoter clearance is the transition from initiation into transcript elongation. During this intermediate phase, the contact with initiation factors is lost and stable association with the nascent transcript is established.

3. The elongation is when the strand of the DNA is used as a template for the RNA synthesis. The RNA polymerase traverses the DNA and the RNA is assembled by using base pairing A–U and G–C.

4. The termination is the end of the transcription. This step is not yet well understood and therefore not described in this thesis.

The mRNA serves as a template for protein biosynthesis in the translation process. It is formed by codons which are triplet of nucleotide. These codons form the template read by the ribosome, which is a large molecular machine that links amino acids together in the order specified by mRNA. The first codon is called the start codon and is very often AUG. It corresponds to a methionine when translated in amino-acid. The rest of the codon encode for one of the twenty-two amino acids. There are also stop codons that mark the end of the template. The mapping between a codon and an amino acid is the genetic code. This code is highly similar among all organisms. As an example, the complete genetic code ofE. coliis provided in table4. The translation process consists of four steps:

1. the initiation, the ribosome assembles around the target mRNA and the first tRNA is attached to the start codon. Roughly, the tRNA is a small strand of folded RNA that is linked to an amino acid. It is also composed of an anticodon which three bases that can form pairs with a codon in the mRNA.

2. The elongation, the tRNA transfers an amino acid corresponding to the next codon.

3. The translocation, the ribosome moves to the next mRNA codon.

4. The termination, the ribosome releases the polypeptide when a stop codon is reached.

Some genes are constitutive, that is to say a gene which is continually transcribed. But there are genes that are facultative and are only transcribed when needed. Indeed the gene expression, the name given to the process

(29)

Table4: The genetic code mapping a triplet of ribonucleotides to an amino acid. The methionine (Met) also act as the start codon (AUG). The Amber, Ocre and Opal codon are the stop codons. This table maps only to twenty amino acids, but it has been recently discovered that the UGA codon maps to a selenocysteine when the selenocysteine insertion sequence element is present during the transcription. The UAG codon is translated into pyrrolysine in a similar way [Rother and Krzycki,2010,Papp et al.,2007,Zhang et al.,2005].

The genetic code

UUU Phe UCU Ser UAU Tyr UGU Cys

UUC Phe UCC Ser UAC Tyr UGC Cys

UUA Leu UCA Ser UAA Ocre UGA Opal

UUG Leu UCG Ser UAG Amber UGG Trp

CUU Leu CCU Pro CAU His CGU Arg

CUC Leu CCC Pro CAC His CGC Arg

CUA Leu CCA Pro CAA Gln CGA Arg

CUG Leu CCG Pro CAG Gln CGG Arg

AUU Ile ACU Thr AAU Asn AGU Ser

AUC Ile ACC Thr AAC Asn AGC Ser

AUA Ile ACA Thr AAA Lys AGA Arg

AUG Met ACG Thr AAG Lys AGG Arg

GUU Val GCU Ala GAU Asp GGU Gly

GUC Val GCC Ala GAC Asp GGC Gly

GUA Val GCA Ala GAA Glu GGA Gly

GUG Val GCG Ala GAG Glu GGG Gly

by which the information in a gene is used to build a functional product (usually a protein), is subject to regulation. The gene regulation is a complex process and is not yet fully understood. In this thesis we only describe roughly the process of transcriptional regulation, which is the way the cell regulates the copy of DNA into a RNA molecule. More precisely we focus on the regulation through transcription factors.

Transcription factors are proteins that bind to specific DNA region to regulate the expression of a gene (e.g.promoter). The transcription factor can act as an activator or an inhibitor by recruiting or repressing the RNA poly- merase. Usually several transcription factors must bind to the DNA to recruit other factors and the RNA polymerase. Interestingly, only a small subset in the genome, that is to the complete set of DNA, may encode transcription factors (≈2000,i.e. 7% of the human set of proteins or proteome). But as they function in group, the combinatorial use of this subset could mean that each gene is uniquely regulated [Brivanlou and Darnell,2002]. Also post-translational modifications (i.e. chemical modification of the protein) orchestrate all transcription factor functions, including subcellular localiza- tion, protein stability, protein-protein interactions (i.e. with cofactors) and transcriptional activities [Filtz et al.,2014,Tootle and Rebay,2005].

This combinatorial use of transcription factor gives rise to complex inter- actions between these regulator. Such interactions can be modeled by a so called gene regulatory network and represent the interactions between the DNA regions that are regulated and other molecules (like proteins). These interactions can be direct (i.e.the binding of a transcription factors activating the transcription) but not only. Indeed, some transcription factors may acti- vate or inhibit the transcription of other transcription factor or a protein (an

(30)

1.2 Genes

Cra

Crp-cAMP Fis β-D-fructofuranose-1-P

DksA

IHF PhoB

NsrR ppGpp

Figure5: Example of a small partial regulatory network. This diagram il- lustrate the crp regulation inE. coliand shows how theσ-factor is recruited (σ70). This factor is needed to initiate the transcription.

Pointed arrows represent activation, diamond arrows represent activation or inhibition and blunt arrows are inhibition. This dia- gram shows that Cra, Crp and Fis have a direct regulating effect on theσ-factor. Cra activates the transcription initiation, Fis inhibits and Crp-cAMP activates or inhibits. The network also show other activators and inhibitors effects,e.g. β-D-fructofuranose-1-P inhibits Cra and Fis activates Crp. All these interactions form the regulatory network.

enzyme) responsible of a post-translational modification having an effect on another transcription factor. These kind of interactions and many more make the complex regulatory network. A small example of network representing the crp regulation inE. coliis provided in figure5. The reconstruction and study of these networks is an active subject of research in system biology [Lee et al.,2002,Teichmann and Babu,2004,Hecker et al.,2009].

It is also interesting to note that a promoter is not always regulating one unique gene, but can control the expression of several genes, called an operon. An operon is a cluster of genes controlled by a single promoter.

That is to say, when the transcription process starts, all genes in the operon are transcribed into mRNAs. These mRNA are then translated together (polycistronic mRNA) or are trans-spliced, thus producing monocistronic mRNAs which are translated independently. A mRNA is monocistronic when it contains the information of only a one polypeptide. A polycistronic mRNA encodes for several polypeptides1. An operon is composed of four components.

1. A promoter (see previous description in the text).

2. An operator which is a DNA region between the promoter and the genes where a repressor binds and obstructs the RNA polymerase.

3. The genes that are co-regulated by the operon.

Operons are found in prokaryotes and eukaryotes and expression of eu- karyotic operons usually lead to the transcription of monocistronic mRNAs (as opposed to prokaryotic operons, which leads to polycistronic mRNAs) [Blumenthal,2004].

1. A more correct way to explain it, is to say that a polycistronic mRNA contains several open reading frames (ORFs). But the description of an ORF is not provided in this document and the reader can find good descriptions in [Lodish et al.,2000,Berg et al.,2002]

(31)

Table5: The six major classes of enzymes and the reaction’s types. The table is partially taken from [Berg et al.,2002].

EC Class Type of reaction Example

1 Oxidoreductases Oxidation-reduction Lactate dehydrogenase 2 Transferases Group transfer N-acetyltransferase 3 Hydrolases Hydrolysis reactions

(transfer of functional groups to water)

Methionine aminopep- tidase

4 Lyases Addition or removal

of groups to form dou- ble bonds

Fumarase

5 Isomerases Isomerization (in- tramolecular group transfer)

Triose phosphate iso- merase

6 Ligases Ligation of two sub- strates at the expense of ATP hydrolysis

Aminoacyl-tRNA syn- thetase

One can cite several other regulation process, like histone rearrange- ment, action of transcriptional enhancers [Levine, 2010] and other post- transcriptional regulation strategies. The gene expression processes are very complex and may not be fully understood. We will not provide a description, even summarized, of the regulation strategies as it goes far beyond the scope of this thesis.

1.3 e n z y m e s

Enzymes are biological molecule acting as the catalysts of biological sys- tems. The majority of enzymes are proteins2and increases the reaction rate to a factor of a million. Several impressive and extreme examples are found inRadzicka and Wolfenden[1995] where the orotidine5’-phosphate decar- boxylase enhances the rate of reaction by a factor of 1017 (non-enzymatic half-life3: 7.8·107years) or the staphylococcal nuclease by a factor of 1014 (non-enzymatic half-life: 1.3·105years). Hence without such catalysts most biological reactions will not function at a rate able to sustain life in the cell.

They are also highly specific and usually catalyze a single or a closely set of chemical reactions. The enzymes are classified based on the chemical reactions (table5) they catalyze and currently there are4867active enzyme classes [Schomburg et al.,2012].

Enzymes are usually much larger than their substrates and only few amino acids play a role in the catalysis. Those few amino acids compose the catalytic site, located next to binding sites where residues orient the substrates. The catalytic and binding site compose the enzyme’s active site. It should be noted that in some enzymes no amino acids are directly involved in catalysis.

Rather cofactors bind to the enzyme and take part themselves to the catalytic reaction.

2. Also catalytic RNA molecules or ribozymes (ribonucleic acid enzymes) are capable of catalyzing specific biochemical reactions [Kruger et al.,1982].

3. The half-life (denoted ast1/2) is a description of how fast a reaction is occurring. It is the time for half of the reactant initially present to decompose.

(32)

1.4 Post-translational modifications

a b c d

Figure6: Illustration of the feedback inhibition in a chain of three enzymatic reactions (partially reproduced fromBerg et al.[2002]). The chemi- cal “a” is transformed by enzymes to the final product “d”. Each arrow represents a reaction catalyzed by an enzyme. The dashed arrow represents the feedback inhibition of the first enzyme where the chemical “d” binds to the enzyme. The binding is reversible in order to allow the conversion from “a” into “d”.

Enzyme activity can be regulated by several strategies. For example they can be inhibited by their final product (feedback inhibition, figure6) by a reversible allosteric interaction. Some enzymes change conformation when it interact with other molecules, thus modifying their activity. This mechanism is called allostery. For example an effector molecule binds to a site other than the active site. This binding often resulting in a conformational change of the protein, thus altering the catalytic activity of the enzyme. An effector molecule enhancing the enzyme’s activity is called an allosteric activators. The opposite, a molecule decreasing the activity is called an allosteric inhibitors. Many enzymes are synthesized in an inactive form (called a zymogen) and are activated by a digestive enzyme that cleaves them. The cleavage induces a conformational change that produce an active form of the enzyme. Such activation is called proteolytic activation. Covalent modification is another mechanism of regulation. It consists in the attachment (mostly reversible) of a chemical group that modifies the properties of the enzyme (i.e. by blocking the substrate binding to the active site). Let’s add that enzymes undergo the process of gene expression regulation, as they are proteins.

1.4 p o s t-t r a n s l at i o na l m o d i f i c at i o n s

Post-translational modifications are covalent modifications occurring dur- ing protein biosynthesis. These modifications or processing change the properties of a protein by the attachment of the functional groups (e.g. Nα- terminal acetylation), changes of the chemical nature (e.g. deamidation), cleavage of one or more residues (e.g. initiator methionine cleavage), or structural changes. For example, it can determine:

— the activation of a protein as it has been seen with the proteolytic activation of a zymogen in section1.1.

— The localization and turnover as this will be illustrated with the de- scription of Nα-terminal acetylation in section1.4.2.

— The structure, like the disulfide bonds between two cysteines that stabilizes the folded form of a protein by holding two portions of the protein together.

The post-translational modifications broaden the diversity of functional groups of the22standard amino acids and produce diverse forms of proteins that cannot be directly derived only from its genes [Walsh,2006,Schwartz et al.,2009]. Since the mature form of a protein cannot be inferred only by

(33)

Table6: List of some common post-translational modifications (PTM) and their functions. This list is non-exhaustive and is partially taken fromMann and Jensen[2003].

PTM type Function Notes

Phosphorylation Activation/inactivation of enzyme ac- tivity, modulation of molecular interac- tions, signaling

Reversible

Acetylation Protein stability, protection of N termi- nus, regulation of protein-DNA inter- actions

Methylation Regulation of gene expression

Acylation Cellular localization and targeting sig- nals, membrane tethering, mediator of protein-protein interactions

Glycosylation Excreted proteins, cell-cell recogni- tion/signaling, regulatory functions

Reversible Hydroxyproline Protein stability and protein-ligand in-

teractions

Sulfation Modulator of protein-protein and receptor-ligand interactions

Disulfide bond formation

Intra and intermolecular crosslink, pro- tein stability

Deamidation Possible regulator of protein-ligand and protein-protein interactions Pyroglutamic

acid

Protein stability, blocked N terminus Ubiquitination Destruction signal

Nitration of ty- rosine

Oxidative damage during inflamma- tion

genes, the knowledge of a protein’s post-translational modification helps to understand the roles, the possible interactions, or the activity of a protein.

In this work we focused on two post-translational modifications: the initiator methionine cleavage and the Nα-terminal acetylation of eukaryotic proteins. Those post-translational modifications will be introduced in the next sections (see1.4.1and1.4.2). Regarding the description of those post- translational modifications, the content of this introduction may not be sufficient to understand it. If the reader is interested in understanding the details, he is invited to read [Berg et al.,2002, chap.3,5and8] and [Lodish et al.,2000, chap.3,4,6and18].

1.4.1 Initiator methionine cleavage

A ribonucleic acid molecule (RNA) is a chain of ribonucleotides joined by covalent bonds. Each nucleotide is composed of one of the following base: adenine, cytosine, guanine or uracil (respectively symbolized by A, C, G and U). During the translation step a mRNA molelcule is decoded by readingthe nucleotide by triplet or codon. The translated messenger RNA

(34)

1.4 Post-translational modifications

usually starts with an AUG codon which correspond to a methionine in all genetic code4. Hence the first residue of newly synthesized protein is the methionine [Meinnel et al.,1993,Nakamoto,2009]. However this methionine is usually removed in a process is called N-terminal methionine cleavage.

For any for any given proteome , this occurs for between50% and70% of the proteins [Meinnel et al.,1993,Frottin et al.,2006]. The initiator methionine cleavage seems to be conserved in all organisms and the rules driving the cleavage seems to be similar in those organisms.

The enzyme catalyzing the initiator methionine cleavage is the methionine aminopeptidase (MAP) and process the nascent protein during its biosynthe- sis as soon as the first residues are assembled by the ribosome [Arfin and Bradshaw,1988]. The process is thus cotranslational and also irreversible (∼15residues). Two classes of MAPs have been identified and both can be expressed in the same organism (i.e. inEukaryota): MAP Ib and MAP IIb [Bradshaw et al.,1998]. Recently it has been suggested that the activity of initiator methionine cleavage is controlled instead of being considered as constitutive [Giglione et al.,2004].

While inEubacteriaboth MAP genes essential, their essentiality inEukaryota is less clear. InS. cerevisiaethe disruption of both genes is lethal, suggesting that cytoplasmic initiator methionine cleavage is essential in lowerEukaryota.

In higherEukaryota the essentiality of the initiator methionine cleavage is unknown but data suggest that disruption of MAP2genes causes abnormal development phenotype in theDrosophilaand also that in malignant human cell MAP2inhibition causes apoptosis. In addition the blocking of MAP2 activity with a specific inhibitor results in the interruption of the G1phase5. Recently data strongly suggest that initiator methionine cleavage is involved in controlling the protein half-life. However despite these discoveries, the role of initiator methionine cleavage is poorly understood [Giglione et al., 2004].

The proteomics of initiator methionine cleavage is well characterized and substrate specificity for MAP has been studied. The rule is that cleavage occurs if the side chain of the amino acid following the initiator methionine is small enough (Ala, Cys, Pro, Ser, Thr and Val). This rule suggest that the process is correlated with the length and the gyration radius of the residue’s side chain [Frottin et al.,2006]. A more detailed analysis on the proteomic of the initiator methionine cleavage is provided in chapter8to illustrate the efficiency of the approach presented in this thesis.

1.4.2 Nα–terminal acetylation

The first Nα-acetylated protein has been discovered in1958[Narita,1958] and now it has been published that the process of acetylation is a very common modification and occurs on≈50% of the yeast proteins and≈80% of human proteins [Brown and Roberts,1976,Arnesen et al.,2009,Polevoda et al.,2009]. This irreversible process is one of the most common covalent modification. The Nα-terminal acetylation is a cotranslational modification and occurs when between25and70residues of a nascent protein emerge from the ribosome [Strous et al.,1974,Pestana and Pitot,1975]. The process is catalyzed by N-acetyltransferases (NAT) and consists in transferring the

4. Although non-AUG codons may be used to initiate the translation inEukaryotaand in Prokaryota.

5. The growth1/gap1phase is the first of four phases of the cell cycle that takes place in eukaryotic cell division

(35)

NH R

O +

3

NH COO_

O R NH O

NH COO_ O

N N N N

NH2

O

PO OH HO

HO O OP

O HO OP OH O OH NH

O NH

O S

O O

N N N N

NH2

O

PO OH HO

HO O OP

O HO OP OH O OH NH

O NH

O S O

NH COO_ +NH

3

+NH

3 NH

O NH COO_ NH

+NH

3 NH

O

NATs HATs & KATs

Figure7: The N-terminal acetyltransferases (NATs) use Ac-CoA to catalyze the acetylation of α-amino groups on N-terminal residues. This reaction produce a CoA and an N-terminal acetylated polypeptide.

There is another case of peptide acetylation on theε-amino group of internal lysine side chains. The acetylation is catalyzed by histone acetyltransferases (HATs) and internal lysine acetyltransferases (KATs). The acetyl functional groups are drawn in orange.

acetyl group from from an acetyl group from acetyl-coenzyme A (Ac-CoA) to theα-amino group of the first amino acid of the protein [Gautschi et al.,2003, Pestana and Pitot,1975,Polevoda and Sherman,2003,Polevoda et al.,2008, 2009]. NATs catalyze this transfer and the reaction releases a coenzyme A (CoA) and an N-terminal acetylated polypeptide (figure7).

Three NATs have been identified and are thought to be responsible for the majority of the N-terminal acetylation events, counting for∼85% of the acetylated proteins [Arnesen et al.,2009]. Those NATs are named by NatA, NatB and NatC. The rest is believed to be catalyzed by three other NATs, named by NatD, NatE and NatF. Thus as for the time of the writing of this thesis, six NATs have been identified. However the NATs’ substrates are still not well known (or only partially known). The first three NATs (NatA, B, C) are well conserved from yeast to humans and are characterized based on their supposed substrate specificityPolevoda et al.[1999]. The NatA is the enzyme having the more supposed substrates (six) and the only one acetylating the non methionine residue at the start of the polypeptide (i.e.

after the initiator methionine cleavage) [Polevoda and Sherman,2003]. The NatB and C acetylate the methionine first residue and may be associated with the ribosome. The NatB seems to prefer polar substrate and the NatC hydrophobic substrates.

The other three NATs (NatD, E, F) may not be spread among the organisms as the first three. For example, NatD activity was described in yeast but no such activity has been noticed in higher eukaryotes and seems to acetylate only N-termini of histones H2A and H4(Ser-Gly-Gly and Ser-Gly-Arg) [Hole et al., 2011]. For the NatE, onlyin vitro activity has been described for the human but evidence ofin vivo activityis lacking [Evjenth et al.,2009].

Regarding the NatF, it seems to be responsible of the increase in occurrence of Nα-acetylated proteins from lower to higher eukaryotes. Because NatF has only been found in higer Eukaryota, this presence could explain the higher observed rate of acetylation in those organisms. Indeed, NatF shares substrate with the previously described NAT [Van Damme et al.,2011].

Références

Documents relatifs

Keywords: Behavioural Science, Behavioural Economics, Health Promotion, Public Health, Nudge.. David McDaid is Senior Research Fellow at LSE Health and Social Care and at

That is what happens in quasi- belief, according to Sperber: the representation which is embedded in a validating meta- representation ('Lacan says that...' or 'The teacher

3) Compute how many stitches Lisa will knit the second hour, and then the third hour. 5) Determine the number of hours Lisa will need to knit her grand-mother’s gift.. Please do

My host was Sancharika Samuha (SAS), a feminist non- governmental organization (NGO) dedicated to the promotion of gender equality of women through various media initiatives

Toute uti- lisation commerciale ou impression systématique est constitutive d’une infraction pénale.. Toute copie ou impression de ce fichier doit conte- nir la présente mention

9 As outlined in these brief summaries, the role of science in Professional Foul and Hapgood has much in common with Stoppard’s use of chaos theory in Arcadia, since catastrophe

The conclusion is unequivocal: without exception the science teachers we worked with completed all their formal education – basic and professional one – with no contact at all

Grinevald, C., 2007, "Encounters at the brink: linguistic fieldwork among speakers of endangered languages" in The Vanishing Languages of the Pacific Rim. 4 “Dynamique