Extreme pathways similarity approximation algorithm

Protein (+) sense RNA (-) sense RNA

Algorithm 4 Extreme pathways similarity approximation algorithm

Require: A bipartite graph G = (V,E), with n reactions identified by a unique index

Ensure: A matrixDcontaining all pair of distance between chemical reac-tions

R_ij←[ ], an empty list for each reactions pair{i,j} whilea stop criteriondo

S←a random metabolic subnetwork extracted fromG P←compute the extreme pathways matrix fromS for allreaction pairs{i,j}inSdo

dij←compute the extreme pathways similarity adddijto the list Rij

end for end while

D←a matrix filled with zeros∈R^n×n D_ij ←compute the mean onR_ijfor alli,j

of the rest of the system³. But this is something we wanted to avoid as we wanted a fast network’s sampling and distance estimation. Nevertheless this is not an issue as we have empirically seen that it converge to the real matrix.

The quality of the estimation of this approach is now demonstrated empir-ically. There are two parameters for this approximation algorithm because of the metabolic subnetwork extraction part. The number and the size of the metabolic subnetworks. To select the right parameters several values have been scanned and the quality of the approximation obtained with those parameters computed. Intuitively the more the subnetworks are sampled and the biggest they are the better the approximation should be. The reasons are that by augmenting the number of subnetwork the probability to pick a specific reaction is increased in order to approximate its distance with other reactions. By increasing the size of the subnetwork the probability to pick a specific pair of reaction is increased. Hence the increase in both value should lead to a better approximation of the distance.

To experimentally measure the quality of the approximation of the extreme pathways distance in order to select the best pair of parameters,nmetabolic subnetworks are extracted from the completeE. colimetabolic network with the forest fire like method (section3.8.1). The reason is that it is impossible to compute the extreme pathways of a genome scale reconstructed network as we need it to compute the exact distance to compute the error of the approximation. Then for each of those subnetworks the approximation algorithm is applied with the following set of parameters. The subnetwork sizes: 10,25,50,100,200and300and the number of subnetworks used to compute the distance between pairs of reactions:1,50,100,250,500and100.

The results in figure46show the Frobenius norm between the real distance matrix and the approximated matrix. They first show that ifenoughsample are drawn from the network as described in section3.4, the approximated matrix converges to the exact matrix. Then they show that the rate of convergence depend on the size of the sample. The consequence of the sample size is better understood with figure 46b. If the sample contains almost all reactions, it has obviously a high probability to already contains all the extreme pathways. Thus producing a very good approximation

3. It can also been solved by considering a partition of the rest of the system into several subsystems. Then solve each subsystems and then compute the proper constraints.

Sample size 10 Sample size 25 Sample size 50 Sample size 100 Sample size 200 Sample size 300

Number of subnetworks samples

Euclidian norm

1 100 200 300 400 500 600 700 800 900 1000

(a) Measure of the Frobenius norm kD−Dek_F, withD being the reactions extreme pathways distance matrix and De being the computed approximation ofD.

Sample size 10 Sample size 25 Sample size 50 Sample size 100 Sample size 200 Sample size 300

Number of subnetworks samples

Fraction of common reactions

1 100 200 300 400 500 600 700 800 900 1000

(b) Measure of the fraction of common reactions taken from all the subnetworks for each approximation versus the complete metabolic network.

Figure46: Measure of the quality of the extreme pathways distance approxi-mations. In this figure, a sample is a subnetwork extracted from the complete metabolic network and whose extreme pathways have been computed. A measure is made for the following sample sizes:10,25,50,100,200and300along with the following number of samples:1,50,100,250,500and100. This experiment has been repeated100times and the plots shows the mean.

4.6 Red blood cell functional modules analysis

Table15: The percentage of network components represented by a subnet-work of sizenregarding the network it is extracted from.

Reactions Compounds Vertices

Table16: The Standard deviations for each pair of parameters used in the subnetworks sampling. Those correspond to the deviation of the values presented in figure46a.

Sample

of the distance matrix. The plot in figure 46b also help to understand why by increasing the number of sample, the quality of the approximation increases. Indeed, the more samples are drawn, the probability that the samplescover all the network’s reactions. Thus computing pair distances that have not yet been computed and therefore increasing the quality of the estimated distances. The drawback of too big samples is that the extreme pathways computation of the sample may take too much time, or even may be intractable. The drawback of too small samples is that the approximation may take a long time to converge. Worse, it may never converge as if it may never capture some reactions’ pairs in the same sample. But the intuitive reason of success of this approximation lies in the fact that it is very unlikely to have a large number of extreme pathways that contains reactions that are far from each other in the network. Thus suchlongextreme pathways account for a similarity that is very small between such reactions. Most of the extreme pathways containing a given pair of reactions are probablyconfined in a small subsystem and is catch when the sample size is not too small.

In the next sections we are going to assess the quality of our metric and its approximation by analyzing real metabolic network. The first one is the human red blood cell (erythrocyte) metabolism. As the network is small, no approximation is used because we are able to compute all the extreme pathways. Then we use our metric to theE. colimetabolism and propose a quantitative way to assess the quality of our metric and approximation.

4.6 r e d b l o o d c e l l f u n c t i o na l m o d u l e s a na ly s i s

Red blood cells (or erythrocytes) contain hemoglobin which binds and allow the transport of oxygen to all tissues. Red blood cells require energy

to maintain the following functions: glycolysis, synthesis of metabolite, like glutathione, purine metabolism, protection from oxidative denaturation and maintenance of the electrolyte gradient between plasma and cytoplasm.

The human erythrocyte is anucleate once mature⁴. Hence it produces its energy only by anaerobic glucose degradation. These featuresallows the erythrocyte’s metabolism to be represented by a simple network, illustrated in figure47. This representation can appear more complex than those seen in literature or textbooks [Price et al.,2003,van Wijk and van Solinge,2005, Kanehisa and Goto,2000]) because we decide not to artificially duplicate chemicals (e.g.F6P and GA3P). The purpose is to highlight the complexity in manually identifying modules or pathways. Indeed, some chemicals are duplicated in the schema in order to clearly show the different modules.

This has the disadvantage to provide a biased view toward the historical discovered pathways. We add that in the Embden-Meyerhof pathway (EM pathway or EMP), the reactions catalyzed by HX and PFK consume each one mole of ATP and that the reactions catalyzed by PGK and PK produce each two moles of ATP. In the pentose phosphate pathway (PPP), we also add that two molecules of nicotinamide adenine dinucleotide phosphate (NADP⁺) are reduced to (NADPH) by the reactions catalyzed by G6PDH and PDGH figure47. The used abbreviations are listed in table18(enzymes) and table17(chemicals). The analysis of the red blood cell metabolism was already performed in several publication to test computational analysis of the metabolism, especially in the context of the extreme pathways [Wiback and Palsson,2002,Price et al.,2003,2004].

We decompose this metabolic network to produce modules with the dis-tance described above and then inspect them to assess the quality or correct-nessof the clustering. The goal is to do a qualitative evaluation of our metric.

First we try to detect functional modulesby enzymesrather than by reactions.

Indeed an enzyme may catalyze a reaction in both direction. If a reaction is reversible, in the extreme pathways framework such a reaction is replaced by two irreversible reactions. Thus we have two possibilities to cluster reactions.

(i) We can consider that the extreme pathways of a reversible reaction is the union of the extreme pathways sets of the reaction and its reverse. This can be seen as a clustering by enzyme catalyzing the reactions. (ii) We consider the reaction and its reverse as two distinct reactions. This variant takes the direction of the fluxes into account and produce a more fine clustering. We then use those sets to compute the extreme pathways metric. The hierarchical clustering is shown in figure48(a).

Despite the fact that we are using an unbiased method to produceunbiased module (in the sense of Papin et al. [2004a]), the hierarchical clustering produces groups that are part of the classic or historical pathways. The first observation is the split into two modules: the glucose degradation and the nucleotide metabolism. These modules are labeled 1 and 2 in figure 48. From this point and for the rest of the paragraph, the circled number i are the modules labeled in figure48. Then if we focus on the glucose catabolism 1, the hierarchy shows several known groups. The entrance 4 to the PPP from nucleotide metabolism, the glucose catabolism and the Rapoport-Luebering shunt

1 and 3

. The splits yield the following modules: the Rapoport-Luebering shunt

, the majority of the Embden-Meyerhof pathway

5 and 7

, the the oxidative phase of the PPP 6

, and the non-oxidative phase of the PPP

. We also see the three parts of the

4. The gas transport requires erythrocytes to pass through microcapillaries and therefore constrains their size

4.6 Red blood cell functional modules analysis

Figure47: Network representation of the erythrocyte’s metabolism. The dashed border represents the boundaries of the system (or the cell membrane). The gray boxed names are the enzymes that catalyze the reactions. The system use glucose (Glu) as input,23DPG and HX as outputs. The adenine (ADE), inosine (INO), pyruvate (Pyr) and lactate (Lac) act as both input or output of the system. The currency metabolites are H, CO₂, H₂O, NH₃ and Pi. The PGK and PK produce each two moles of ATP. The G6PDH and PDGH reduce two molecules of NADP⁺ to NADPH.

Table17: List of chemical abbreviations used in the human erythrocyte metabolic network.

Abbreviation Chemical

13DPG 1,3-diphosphoglycerate 23DPG 2,3-diphosphoglycerate 2PG 2-phosphoglycerate 3PG 3-phosphoglycerate 6PGC Phosphogluconate 6PGL Phosphogluco-lactone

ADE Adenine

ADO Adenosine

AMP Adenosine mono-phosphate DHAP Dihydroxyacetone phosphate E4P Erythrose4-phosphate F6P Fructose6-phosphate FDP Fructose1,6-bisphosphate G6P Glucose6-phosphate

GA3P Glyceraldehyde3-phosphate

Glu Glucose

HX Hypoxanthine

IMP Inosine mono-phosphate

INO Inosine

Lac Lactate

PEP Phosphenolpyruvate

PRPP 5-phosphoribosyl1-pyrophosphate

Pyr Pyruvate

R1P Ribose1-phosphate R5P Ribose5-phosphate RL5P Ribulose5-phosphate S7P Sedoheptulose7-phosphate X5P Xylulose5-phosphate

4.6 Red blood cell functional modules analysis

Table18: List of enzyme abbreviations used in the human erythrocyte metabolic network.

Abbreviation Enzyme

ADA Adenosine deamidase

AdPRT Adenine phosphoribosyl transferase

ALD Aldolase

AMPase Adenosine monophosphate phosphohydrolase AMPDA Adenosine monophosphate deamidase

DPGase Diphosphoglycerate phosphatase DPGM Diphosphoglyceromutase

EN Enolase

G6PDH Glucose-6-phosphate dehydrogenase

HGPRT Hypoxanthine guanine phosphoryl transferase

HX Hexokinase

IMPase Inosine monophosphatase LDH Lactate dehydrogenase

PDGH 6-phosphoglycononate dehydrogenase

PFK Phosphofructokinase

PGI Phosphoglucoisomerase

PGK Phosphoglycerate kinase PGL 6-phosphoglyconolactonase

PGM Phosphoglyceromutase

PK Pyruvate kinase

PNPase Purine nucleoside phosphorylase

PRM Phosphoribomutase

PRPPSyn Phosphoribosyl pyrophosphate synthetase R5PI Ribose-5-phosphate isomerase

TA Transaldolase

TKI Transketolase

TKII Transketolase

TPI Triose phosphate isomerase Xu5PE Xylulose-5-phosphate epimerase

PNP ase PRM

DPGM DPGase PGI PGK PK PGM EN

PDGH G6PDH PGL HK R5PI

PFK TPI ALD

GAPDH

TKII TKI

Xu5PE TA AK AMPase ApK HGPRT IMPase AMPDA PRPPsyn AdPRT

0.00.20.40.60.8

Height

4 5 6 7 8

PPP (oxy)

Gly (low) Glu

R/L Shunt

23DPG

Pyr

PPP

(non-oxy) R5P INO

Gly (mid) Nucleobases

ADE

HX ADO 1

8 6

7 4 2

G6P

Lac

Undirected functional modules in erythrocyte

Figure48: Undirected hierarchical clustering of the erythrocyte metabolism (a) and metabolism description through functional modules (b).

The hierarchy shows a clear separation between the glucose catabolism and the nucleotides metabolism. Then several modules can be extracted in glucose catabolism. A great part of the EM pathway, the PPP (even a split oxidative and non oxidative phase) and the Rapoport-Luebering shunt. Those modules are labeled with circled number in the hierarchy and the functional view of the cell.

4.6 Red blood cell functional modules analysis

Table19: List of outliers in the hierarchy for the considered human red blood cell states: healthy, GPDH deficiency and PK deficiency.

State Outliers

Healthy EN (rev), PNPase (rev), PGL (rev), PRM (rev), GAPDH (rev), TPI (rev), ADA, LDH (fwd), ALD (rev), PGM (rev), LDH (rev), PGK (rev)

GPDH deficiency LDH (fwd), PGI (rev), EN (rev), PNPase (rev), PDGH, PGL (fwd), ALD (rev), PGL (rev), PRM (rev), GAPDH (rev), PGM (rev), TPI (rev), ADA, LDH (rev)

PK deficiency LDH (fwd), ADA, PGK (fwd), GAPDH (rev), PNPase (rev), TPI (rev), EN (fwd), ALD (rev), EN (rev), PGL (rev), LDH (rev), PGM (fwd), PGM (rev), PRM (rev)

EM pathway correspond to the three stages of the glycolysis as presented in Berg et al.[2002]: (i) phosphorylation of the glucose (ATP consumption), (ii) cleavage of six-carbon sugar into two three-carbon sugars and (iii) oxidation of three-carbon fragments produces ATP. Also the isolation of the Rapoport-Luebering shunt was detected as a module and is an important path used by the cell to regulate oxygen affinity [MacDonald,1977]. The figure48(b) shows a modular view of the mature erythrocyte metabolism produced with the computed hierarchy (with a manually chosen cut).

However the metric has the advantage of producingdirectedclusters, as for a reversible reaction, we have to cluster two irreversible reactions. This is an interesting feature of our metric as it allows to describe in which way conversions are processed by the module. To illustrates this, we applied it to the red blood cell metabolism. It is very important that the reader realizes that some hypothesis drawn from the following experiments that are not backed by a published study stay purelyin silicohave not been observedin vivoorin vitro. Moreover they may necessitate further investigation.

The first observation is that the tightly coupled reactions produce similar modules to the one obtained in the undirected case (figure48). However the general organization of the modules is different. Indeed the non-oxidative PPP module is split into two functional modules and appears in well sepa-rated groups in the hierarchy. One balances a pool F6P to R5P and RL5P and the other balances a pool of RL5P to F6P and R5P (figure49). This separation indicates that they are less likely to function together to sustain the steady state (i.e. working like acycle).

To study the modules and obtain a simplest view of the metabolism we applied the following procedure. First we produced a hierarchical clustering and removed the outliers from the hierarchy. The outliers for our considered cases are provided in table 19. For a given cut, we kept the clusters that contains more than two tightly coupled reactions. Then for each selected cluster we consider them as a functional module. Finally, to simplify the module, we applied a network reduction with meta-reactions (see chapter3, section 3.3). To do this we keep all the input fluxes, output fluxes and currency metabolites. Also for every source or sink in the module (see chapter3, section3.2), afreeexchange flux (i.e.input and output) is added for those chemicals.

With this procedure, the hierarchy shows two large functional modules in figure50. One describes how the glucose-6-phosphate (G6P) and hypox-anthine (HX) are balanced with ribulose-5-phosphate (RL5P) and adenosine

RL5P

F6P

GA3P E4P X5P

S7P

R5P Xu5PE

TKII

TKI

RL5P

F6P

GA3P E4P X5P

S7P

R5P Xu5PE

TKII

TKI

TA RL5P

F6P

GA3P E4P X5P

S7P

R5P Xu5PE

TKII

TKI

FWD REV

Figure49: Separation of the non-oxidative PPPundirectedmodule into two directedmodules. Theforwardmodule (FWD) describes the reac-tions that are likely to function together to balance a pool of RL5P and R5P to F6P. Thereversemodule (REV) do the opposite and balances a pool of F6P to RL5P and R5P.

4.6 Red blood cell functional modules analysis

PGK_rew AK R5PI_rew PGI_rew PGK_fwd PNPase_fwd PRM_fwd ALD_fwd TPI_fwd

PFK R5PI_fwd HK DPGM DPGase PGM_fwd EN_fwd PK PGL_fwd PDGH G6PDH GAPDH_fwd

TKI_fwd TKII_fwd TA_fwd Xu5PE_fwd

ApK_fwd AMP

ase HGPRT ApK_rew PGI_fwd TA_rew Xu5PE_rew TKI_rew TKII_rew AdPRT PRPPsyn AMPDA IMPase

0.00.20.40.60.81.0

Height

G6P

RL5P R5P

PRPP ATP

ADP ADE

ADO

IMP HX

INO

AMP Glu NADP + ATP

NADPH + ADP

GA3P ATP F6P

ADP

ATP ADP

Pyr

23DPG

ATP ADP RL5P R5P INO

Directed clustering of healthy erythrocyte

Figure50: The directed hierarchical clustering producing functional modules (a). Only the modules obtained with the first two biggest clusters are represented with meta-reactions in (b) and (c). The colored box around a module highlights its corresponding subtrees in the dendrogram. It should be noted that not all the reactions producing or consuming the component and chemicals are drawn in the schema (b) and (c). The arrows pointing to void products or from void substrates indicates if a chemical can exit or enter the system.

(ADO) with inosine (INO) (figure50(c)). This module may lead to think that one of the nucleotides’ function is to help to balance the concentration of G6P that are not converted through EM pathway. It should be noted that this module do notusethe oxidative phase of the PPP as no NADPH is produced when those chemicals are balanced.

The other modules describe the catabolism of glucose (figure50(b)). The glucose concentration is balanced by the production of pyruvate (Pyr) and 2,3-diphosphoglycerate (23DPG) in the Rapoport-Luebering shunt (R/L). It also yield the production of NADPH and a net production of two adenosine triphosphate (ATP). The module also shows that it balances the concentration of inosine (INO), when available in the cell environment, by producing hypoxanthine or passing through the production of the ribose-5-phosphate (R5P). The latter is then processed by the PPP to pyruvate or23DPG. This is an interesting observation as it functionally implies that when inosine is present in the cell environment, it is more likely to produce R5P and hypoxanthine and function with the EM pathway instead of converting itback to adenosine. Hence the inosine is likely balanced with23DPG and pyruvate.

Although we find this surprising that those reactions are more likely to be activated together, the action of inosine has received much attention in blood banking. It appears that it improves erythrocytes preservation as it allows the production of ATP without using an ATP to phosphorylate the glucose [Gabrio et al.,1956,D’Alessandro et al., 2013]. To conclude this rough analysis, the dendrogram shows that the EM and the PP pathways are mostcoupledthan with the reactions in the nucleotides pathways. This is known to be the way the glucose is processed under normal physiologic circumstances. However, despite these encouraging observations, we were not able to observe the fact that under normal condition the PPP only accounts for8% of glucose metabolism, the rest is metabolized through the EM pathway. This is not reflected by the metric in the dendrogram. Only when the cell is under oxidative stress that90% of the glucose is metabolized through the PPP. This is because the EM pathway is tightly coupled with the PPP pathway and without specific regulation the reactions of both pathways function together.

To assess the potential information or knowledge that our metric can provide, we apply the previous procedure to study the two most frequent en-zymopathies in human erythrocytes: the glucose-6-phosphate dehydrogenase deficiency and the pyruvate kinase deficiency. To simulate an enzymopathy we delete the vertices that correspond the enzymatic reactions and we pro-ceed to a module detection. Then we compare the modules of an healthy

Dans le document Analysis of large biological data: metabolic network modularization and prediction of N-terminal acetylation (Page 108-130)