Protein (+) sense RNA (-) sense RNA

Algorithm 5 Random metabolic network partitions

Require: A graph G = (V,E) with at leastkvertices and the number of desired clustersk

Ensure: kset of vertices: P1,P2, . . . ,Pk such as whose union isV and their intersection is empty

1: P1,P2, . . . ,Pk

2: for allv∈Vdo

3: i←uniformly draw an integer from∈[1,k]

4: Pi ←Pi∪v

5: end for

The comparison is illustrated in figure59and table21. Although we are not producing the perfect classifier with our metric, we obtain better results that with other approaches. Also with the spectral clustering the results obtained are very close to the one based on the Schuster splitting in term of scores. We finally add that we try other linkage criteria (Ward and single) and the results were lower than the one obtained with UPGMA (figure60). This experiment confirms two things: our approach still produces meaningful groups in a genome scale network and that extreme pathways better explain the co-expression of genes. Indeed, the other approaches are only based on the vertices’ degrees.

6. A classifier that produces points close to(0, 1)is also agoodclassifier but it predicts the opposite. To obtain a good classifier we only need to invert its outcome.

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

True-positive rate (sensitivity)

False-positive rate (1 - specificity)

Hierarchical with EP Schuster criterion Gagneur criterion

y=x

Spectral clustering Random clustering

Figure59: Receiver operating characteristic (ROC) curve for the detection of intra-operonic pairs inE. coli. The area under the curve (AUC) are provided in table21. The parameter that allow the production of different scores is the cut in dendrogram.

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

False-positive rate

True-positive rate

UPGMA Single

Ward

Figure60: Receiver operating characteristic (ROC) curve for the detection of intra-operonic pairs inE. coli. Three different linkage criteria are compared: the UPGMA, the single and the Ward criterion. As illustrated in this plot, the UPGA linkage is the best criterion.

4.7 Cluster analysis of theE. colimetabolism

4.7.2 Exploring genes pairs in E. coli

Now we are interested on the pairs clustered together that do not form intra-operonic pairs. To narrow our search we focus only on the clusters that group fully similar reactions (i.e. a distance of zero with our metric).

We then study some of these pairs and try to find if there already exist any biological link relating the genes in the pair. The pairs that have been found in the hierarchy are summarized in table22.

First we found two pairs that are related by a common transcription factor.

We add that we do not take into account gene co-regulated by CRP or FNR, IHF , as those transcription factor regulated respectively and551,308and248 genes (totaling1107). This account for≈26% of all genes directly regulated by a transcription factor inE. coli. The pair purA and purB is co-regulated by the PurR and the pair glmS and nagB is is co-regulated by NagC.

The following pairs are also interesting: gpp with spoT, ppx with relA, gpp with relA and ppx and spoT. This set of genes encode for the product that are part of the so called stringent response. In E. coli, the stringent response is a sophisticated that regulates several cell processes in response of harsh environment or condition. It downregulates proliferative processes like, among others, cell division, DNA replication and ribosome, protein, nucloetides synthesis. It upregulates stress response processes, like the amino acid synthesis, fatty acid synthesis and antitoxin systems. The system is potentiated by the ppGpp, a nucleotides that act as alarmones (or stress signals) in the bacteria [Braeken et al.,2006,Kanjee et al.,2012]. This molecule is produced by relA and spoT. The first one from GTP and respond to amino acid starvation:

ATP+GTP−−)−−*AMP+ppGpp (76) The other one from GDP:

ppGpp+H2O−−)−−*GDP+Diphosphate (77) and respond, among others, to fatty acid starvation, carbon source starvation, phosphorous limitation and oxidative stress. It should also be noted that spoT a also hydrolyzes alarmones, when the amino acid balance is restored.

ppGpp activate relA and inhibits dksA. The ppGpp also inhibits ppx and is bound to dksA to produce a dksA-ppGpp complex. This complex in turn inhibits spoT. However the mechanism leading to SpoT activation is not know [Braeken et al.,2006]. Also the nucleoside diphosphate kinase (ndk product) enhances formation of GTP and ppGpp during starvation conditions in P. aeruginosa[van Delden et al.,2001]. As it is illustrated in figure 61, the reaction catalyzed by the ndk product is the next reaction added to the cluster.

Also hpt and purA are grouped with the stringent response cluster when the partition is composed of118clusters. The cluster composition is {gpp, spoT, relA, ppx, gmk, ndk, pnp, deoD, gsk, hpt, purA, purB, pykA, pykF}

(14genes) and all these genes products are part of the purine metabolism (according to the KEGG pathway classification. Moreover, the purA and hpt products have affinity with (p)ppGpp. Those genes are also paired with the genes in table23and those genes product also interact with (p)ppGpp. The proteins interaction with ppGpp provides a central regulatory framework for many different types of processes. The table 23 shows other genes product that have (p)ppGpp affinity and are paired with hpt. We recall that some genes can be found in different cluster because of the reverse reaction

Table22: Discovered pairs in the hierarchy. This table only shows the pairs that have a distance of0in the hierarchy. The first two column is the genes’ name that compose the pair. Dist (bp) is the distance between the genes in the chromosome counted in base pair and Dist (gene) is the number of genes between the pairs. The “++/- -”

indicates if the genes are on the same strand (but not the strand itself). Both distance are only provided for information purpose.

TF indicates if a transcription factor regulate both genes. Dist (EP) and H. are respectively our distance and the height where the pair is formed in dendrogram.

Gene Gene Dist (bp) Dist (gene) ++/- - TF Dist (EP) H.

gpp spoT 138237 118 no no 0 0

ppx relA 284761 256 no no 0 0

gpp relA 1049094 934 yes no 0 0

ppx spoT 1195744 1072 yes no 0 0

gshA gshB 275439 239 no no 0 0

mgsA aldA 461217 437 no no 0 0

sufS cysK 771886 696 no no 0 0

cysE cysK 1248361 1120 no no 0 0

pgl gnd 1300281 1214 no no 0 0

tdk surE 1575369 1424 no no 0 0

ggt pepN 2045451 1765 no no 0 0

gloB dld 1987403 1813 no no 0 0

cdh cdsA 726644 623 yes no 0 0

sufS cysM 778149 703 yes no 0 0

tdk ushA 786961 718 yes no 0 0

ggt pepA 897617 770 yes no 0 0

ggt pepB 928723 850 yes no 0 0

trxB nrdD 1110077 972 yes no 0 0

tdk yfbR 1114718 1006 yes no 0 0

gloB ldhA 1207072 1099 yes no 0 0

cysE cysM 1242158 1113 yes no 0 0

ggt pepD 1309088 1120 yes no 0 0

tdk yjjG 1325856 1192 yes no 0 0

idi ispH 1634316 1449 yes no 0 0

thiL rdgB 1980012 1741 yes no 0 0

cdh ynbB 2009193 1782 yes no 0 0

purA purB 1426283 1267 no yes 0 0

glmS nagB 1430795 1242 yes yes 0 0

4.8 Conclusion

ppGpp pppGpp GDP

spoT R00336

relA R00429 ppx/gpp

R03409

GTP Binding

Reaction

dksA-ppGpp dksA

0.0 0.2 0.4

R00336 R00429

R03409 R00330

Figure61: On the top of the figure the part of the dendrogram corresponding to the genes of interest (i.e. gpp, spoT, ppx and relA). On the bot-tom the network extracted according to the hierarchical clustering.

Interestingly this correspond to the so called stringent response.

The plain arrows represent the regulation and the dashed arrows the chemical reactions.

Table23: Grouping in the same module for pair of genes encoding for pro-teins interacting with ppGpp. The number of cluster is the result of a cut in the dendrogram at the specific height.

Gene Gene Dist. (EP) Height Number of clusters

guaC hpt 1 0.33 326

purA hpt 1 0.83 174

guaC gpt 0.92 0.73 122

guaC apt 0.92 0.73 122

catalyzed by the product. Such case are illustrated in the application on the human red blood cell in section4.6.

Finally we add that for the tightly related pairs sufS, cysK and cysM, the three genes products are PLP dependent proteins. The pyridoxal5-phosphate (PLP) is the active form of the vitamin B6, a coenzyme7. We also notice that other PLP dependent proteins are paired with sufS in the hierarchy (table24).

4.8 c o n c l u s i o n

The goal of this chapter was to describe a new method allowing the detection of modules of biological meaning in a metabolic network that are unbiased (in the sense of Papin et al. [2004a]). We have proposed a new distance to measure the similarity between a pair of reactions that is based on the extreme pathwaysflowingthrough the reactions. However because of the definition of the distance, its computation may be intractable. Thus we

7. A coenzyme enhance the catalytic activity of an enzyme (apoenzymes). Often the coen-zyme is necessary for the catalytic activity.

Table24: Grouping in the same module for pair of genes that encode for PLP dependent proteins. The number of cluster is the result of a cut in the dendrogram at the specific height.

Gene Gene Dist. (EP) Height Number of clusters

sufS cysK 0 0 416

sufS cysM 0 0 416

sufS gabT 0.88 0.6 152

sufS gadB 0.88 0.72 179

proposed a Monte Carlo approach to approximate the distance and we have empirically showed the quality of the approximation. We also showed that the metric can be approximated even for large networks. Thus the approach is applicable for genome scale metabolic networks.

We then applied our method torealbiological networks in order to assess the interest of the proposed approach from a biological point of view. We have shown that our metric combined with a hierarchical clustering produces functional modules that have a biochemical interpretation. It has also allowed to provide a module description of the metabolism of the human red blood cell, in an undirected and in an directed form. We also try to perturb the metabolism by in silico inhibiting enzymes to see if conclusions can be drawn from the discovered functional modules by comparing them to a normal erythrocyte. In the case of the PK and G6PDH deficiencies in human erythrocytes, we were able to drawn conclusion that were supported by experimental evidences. We were able to notice a cell reorganization by inspecting the dendrogram and to study the modules with a simpler representation through meta-reactions. This allowed us to infer the function of the module and understand the consequences of the cell metabolism reorganizationbecause of the pathology. We are aware that, with the human erythrocyte model, similar conclusions can be derived by inspecting the network or by studying the extreme pathways. But because of the size of the extreme pathways set, the pathways are usually not directly inspected but rather are processed to be analyzed. Hence we stress that we did not inspect the complete network or the extreme pathways set to draw our conclusions.

Indeed, one important feature of our approach is to avoid the complete extreme pathways computation (or any other generating sets, like flux or elementary modes).

Finally we wanted to show that our metric produces valid results on a genome scale network. So we used theE. coli network and try to match operonic structure with success. We also try to analyzenewtightly coupled genes pairs that are formed at the beginning of the hierarchy that are not part of an intra-operonic genes pair. Some of these pairs and the cluster in which they were regrouped were of some biological significance (e.g. the stringent response system).

This method, like other methods based on metabolic network, depends on the quality of the reconstructed network. If some interaction are missing, the metric may not produces several important pairs in the hierarchy. Indeed, we have seen with the human erythrocytes that the deletion of one reaction may strongly perturbs the global organization of the hierarchy. Also clustering approaches based on generating sets suffer a major drawback compared to approaches based on topological feature of the network because several reactions are total outliers. They share no similarities with other reactions,

4.8 Conclusion

because no extreme pathwaysusethem. Those reactions are usually removed during the pruning operation (or redundancies elimination). Thus those reactions are not clustered and are meaninglessly added to the top of the hi-erarchy. It is possible that theefficiencyof the pruning is due to an incomplete reconstruction of the network or a bad setup of the exchange fluxes. From this point of view, our approach may be less convenient to use depending on what part of the metabolism is analyzed. Nevertheless, we are convinced that this metric combined with a hierarchical clustering method produces biological modules that are meaningful. Validating such approach is very difficult and in this work we have focused on E. colibecause it has been extensively studied. This organism has played the role of a validation of our metric by rediscovering already known biological facts. We think we have empirically demonstrated the power our metric and that it can be used to build new biological hypothesis, like complex formation, binding to specific molecule or co-regulation.

Part III

S E Q U E N C E A N A LY S I S

5

B A C K G R O U N D I N P O S T- T R A N S L AT I O N A L M O D I F I C AT I O N S C L A S S I F I C AT I O N

In this chapter we briefly recall of what the problem of classification is and its potential issues. Computer scientists may skip this part as it is considered basic knowledge. This chapter is concluded with a quick review on the post-translational modifications and Nα-terminal acetylation prediction.

5.1 c l a s s i f i c at i o n

In machine learning, or more precisely in supervised learning, the problem of classification is to identify the relationship between instances and classes.

Each instance is a vector ofmattributes and a class is a member of a finite set of cardinalityk. For example, in a medical application, we can classify the patient in a group corresponding to its health risk level: low-risk, medium-risk and high-medium-risk (the classes) according a list of attributes that describe the patient instance: sex, height, weight, age, blood pressure (the inspiration of this example comes fromClarke et al.[2009]). We suppose that a function relating those attributes and the health risk level exists. Thus the classification problem is to find a function that relates the attributes of a patient to the risk level. In machine learning, we need a so called training set which is a set of instances and their corresponding class. In the case of the medical application example, a training set is composed of a set of patients along with their health risk level. Then the learning algorithm tries to find the function that relates the patients and their health risk level. A goal of the algorithm is to be able to provide also a good classification for instances that were not part of the training set. This is called generalization.

More formally we suppose that there exists a function f that relates them attributes to one of thekpossible classes:

f :Xm7→C (78)

withX={x1,x2, . . . ,xn}the instances set and the class setC={c1,c2, . . . ,cq}. An instancexiis represented by the vector of attributesxi ={x1i,x2i, . . . ,xmi}. The learning algorithm uses the so called training set which is a set of exam-ples of the form{xi,cj}(an instance with its corresponding class) to find the function g≈ f:

g:Xm7→C (79)

The functiong is searched such as it minimizes a given error function on the output ˆci=g(xi)on all training instances (i.e.for alli). Thus the error function indicates how wellgfits the training set. For example in the case of a binary classification we may use:

i

|g(xi)−ci| (80) where the classesciand the outcomesg(xi)can be replaced by zero or one.

The functiongis usually taken from a space of possible functionG. Indeed, to approximate the function it is required to make anansatzon f as we can not search gin the space of all possible functions. We also speak of model

selection. Usually it is wise to choose the simplest model (Occam’s razor orlex parsimoniae). But other motivations can drive the choice of the model or the function. In the case of this work, we select the model that produces the classifier that is the most readable. Byreadablewe mean that we are not only interested in the quality of the prediction but also in understanding on how the model use the attribute vector to produce good predictions. Later in the text we speak ofwhite boxmodels, in contrast toblack boxmodels that are very difficult to read or understand. For example,Gcan be the space of linear functions or the set of all decision trees.

A potential problem happening with supervised learning algorithm isover learning. As stated before the algorithm use a training set to infer the function g. However, there are cases where the algorithm identifies relationship that is only present in the training set. The algorithm has a probability to adjust to attribute values of the training data that are not part of the relationship between the instances and their class. It occur for example when the training set is too small. This phenomenon is calledoverfitting. It can be quantified with the loss function: it is well minimized on the training set but not with another validation set. A very simple example can intuitively illustrate this issue. Suppose that the instances are again patients and classes their cardiovascular disease risk level (only two classes: low and high risk). Let’s say that the following attributes are used to represent a patient: age, shirt color, weight and blood pressure. It is more or less known that age, weight and blood pressure is correlated with cardiovascular disease but not the color of the patient’s shirt1. Now let’s assume that in the training set almost all the high risk patients are wearing red shirts by chance. There is a non null probability that the learning algorithm use this random feature to predict patient with a high cardiovascular disease risk. Even if the function gwill produce a loss close to zero with the training set, there is a high probability that the function will be used to predict or tested on the risk of a patient having a non red shirt. In this case the patient could be predicted having a risk of disease because he is not wearing a red shirt.

It is possible to evaluate how a leaner will generalize to an independent data set by using ak-fold cross-validation. This process is widely used to evaluate generalization of a classifier, allowing us to estimate the average generalization error of a machine learning method [Hastie et al.,2001]. For a given datasetD, we split it inksubsets such as their union isD:

D=D1∪D2∪. . .∪Dk (81) and their intersection is empty:

∅=D1∩D2∩. . .∩Dk. (82) We add that theDi’s are stratified, that is to say eachDihas the same classes distribution. Then we buildkfolds. A fold Fiis composed of a learning set Liand a test setTisuch as:

Fi = (

Li= [

j6=i

Dj,Ti=Di )

(83)

1. Such assertion is provided without any reference and comes from general knowledge.

However I apologize to the reader if in a near future, the color of the shirt is indeed relevant in detecting people having a high risk of cardiovascular disease.

5.2 Prediction of Nα-terminal acetylation

and for each fold Fi the algorithm is trained on Li and evaluated on Ti. Finally the classification results for each foldFi are aggregated and provide a more accurate estimate of the model performance. Usual choices forkare 5or10.

Machine learning is a very complex and complete field. Therefore it is meaningless to try to provide a more complete introduction in this document.

The interested and motivated reader is encouraged to readMurphy[2012] or Hastie et al.[2001] to deepen its knowledge in this field.

5.2 p r e d i c t i o n o f nα-t e r m i na l a c e t y l at i o n

Numerous predictors for post-translational modifications have been de-veloped, based on different machine learning models. For example artificial neural networks have been widely used to predict various post-translational modifications, like phosphorylation [Blom et al.,1999], N-terminal myristoy-lation [Bologna et al.,2004] and C-mannosylation [Julenius,2007]. More re-cently random forest have been successfully used to predict post-translational

Numerous predictors for post-translational modifications have been de-veloped, based on different machine learning models. For example artificial neural networks have been widely used to predict various post-translational modifications, like phosphorylation [Blom et al.,1999], N-terminal myristoy-lation [Bologna et al.,2004] and C-mannosylation [Julenius,2007]. More re-cently random forest have been successfully used to predict post-translational

Dans le document Analysis of large biological data: metabolic network modularization and prediction of N-terminal acetylation (Page 130-168)