Identification of metabolic network models from incomplete high-throughput datasets

(1)

HAL Id: inria-00561999

https://hal.inria.fr/inria-00561999

Submitted on 2 Feb 2011

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

incomplete high-throughput datasets

Sara Berthoumieux, Matteo Brilli, Hidde de Jong, Kahn Daniel, Eugenio

Cinquemani

To cite this version:

Sara Berthoumieux, Matteo Brilli, Hidde de Jong, Kahn Daniel, Eugenio Cinquemani. Identification

of metabolic network models from incomplete high-throughput datasets. [Research Report] RR-7524,

INRIA. 2011. �inria-00561999�

(2)

a p p o r t

d e r e c h e r c h e

N

0

2

4

9 -6

3

9

9 IS

R

N

IN

R

IA

/R

R

--7

5

2

4 --F

R

+

E

N

G

Computational Biology and Bioinformatics

Identification of metabolic network models from

incomplete high-throughput datasets

Sara Berthoumieux — Matteo Brilli — Hidde de Jong — Daniel Kahn — Eugenio

Cinquemani

N° 7524

(3)

(4)

Centre de recherche INRIA Grenoble – Rhône-Alpes

Sara Berthoumieux

∗

,Matteo Brilli, Hidde de Jong ,DanielKahn , EugenioCinquemani

∗

Theme: ComputationalBiologyandBioinformati s

ComputationalS ien esforBiology,Medi ine andtheEnvironment Équipes-ProjetsIbisandBamboo

Rapportdere her he n°7524Janvier201132pages

Abstra t: High-throughputmeasurementte hniquesformetabolismandgene expression provide a wealth of information for the identi ation of metaboli network models. Yet, missing observations s attered overthe datasetrestri t thenumberofee tivelyavailabledatapointsandmake lassi alregression te h-niquesina urateorinappli able. Thoroughexploitationofthedataby identi- ationte hniquesthatexpli itly opewithmissingobservationsisthereforeof majorimportan e.

Wedevelopamaximum-likelihoodapproa hfortheestimation ofunknown parameters of metaboli network models that relies on the integration of sta-tisti al priorsto ompensateforthemissing data. Inthe ontextof thelinlog metaboli modeling framework,weimplementtheidenti ationmethod byan Expe tation Maximization (EM) algorithm and bya simplerdire t numeri al optimizationmethod. Weevaluateperforman eofourmethodsby omparison toexistingapproa hes,andshowthatourEMmethodprovidesthebestresults overavarietyofsimulateds enarios. WethenapplytheEMalgorithmtoareal problem, the identi ation of amodel for the Es heri hia oli entral arbon metabolism, basedon hallengingexperimental datafrom theliterature. This leadstopromisingresultsandallowsustohighlight riti alidenti ationissues. Key-words: metaboli network,parameterestimation,systemidenti ation, Expe tationMaximizationalgorithm,in ompletehigh-throughputdatasets, pro-teomi s,metabolomi s,kineti modeling,linlogmodels, entral arbonmetabolism ofEs heri hia oli

∗

(5)

données à haut-débit

Résumé : Les te hniques a tuelles de mesures à haut-débit pour le méta-bolisme et l'expression génique fournissent de très nombreuses données pour l'identi ation demodèlesderéseauxmétaboliques. Cependant, l'existen ede donnéesmanquantestoutaulongdujeudedonnéesrestreintlenombreee tif dedonnéesdisponiblesetrendleste hniques lassiquesderégressionimpré ises ouinappli ables. Ilestdon primordiald'utiliserdeste hniquesd'identi ation qui tiennent ompteexpli itementde es observationsmanquantes.

Nous développons une appro he basée sur le maximum de vraisemblan e pour l'estimation de paramètres de modèles de réseaux métaboliques. Elle repose sur l'intégration de distributions a priori pour ompenser les données manquantes. Nous implémentons ette méthode d'identi ation dansle adre delamodélisationmétabolismeparleformalismelinlogàl'aided'unalgorithme EM(Expe tation-Maximization)et d'uneméthodeplusdire te d'optimisation numérique. Nous évaluonsla performan e de nos méthodes en les omparant à des appro hes existantes et nous montrons que notre méthode EM produit lesmeilleursrésultatssurdiérentss énariossimulés. Nousappliquonsensuite l'algorithme EM à un problème réel, l'identi ation d'un modèle du métabo-lisme entraldu arbone hezEs heri hia oli, baséesur unimportant jeu de donnéesexpérimentalesdelalittérature. Lesrésultatsobtenussontprometteurs etnouspermettentdemettreenéviden e ertainsaspe ts ritiquesdesjeuxde donnéespourl'identi ation.

Mots- lés : réseaux métaboliques, estimation de paramètres, identi ation de systèmes, algorithme Expe tation Maximization, jeu in omplet dedonnées haut-débits, protéomique, métabolomique,modélisation inétique,modèle lin-log,métabolisme entral du arbone hezEs heri hia oli

(6)

1 Introdu tion

Tofurther ourunderstandingof the ellularpro essesshaping theresponse of mi robial ellsto hangesintheirenvironmentrequiresthestudyofthe intera -tionsbetweengeneexpressionandmetabolism. Inre entyearshigh-throughput datasets omprisingsimultaneousmeasurementsofmetabolism(uxes, metabo-lite on entrations) and gene expression (protein and mRNA on entrations) havebe omeavailable[Hardimanet al.,2007,Ishiietal.,2007℄. Thesedatasets provideari hstoreofinformationformodelingthedynami softhebio hemi al rea tion systemsunderlying ellular pro esses. In parti ular, they promiseto relievewhat is urrentlyabottlene kfor modeling in systemsbiology, obtain-ingreliableestimatesofparametervaluesin kineti models[Ashyraliyevet al., 2009,Crampin,2006℄.

Notwithstandingtheseexperimentaladvan es,parameterestimationremains aparti ularly hallengingproblem,amongotherthingsduetoin omplete knowl-edge of the mole ular me hanisms, noisy and partial observations, heteroge-neous experimental methods and onditions, and the large size of networks [Maru iet al., 2011℄. As a onsequen e, the models may not be identi-able, may not generalize to new situations due to overtting, and nonlinear rate fun tions may make them umbersome to analyze. This has led to the proposal of simplied kineti modeling frameworks, in luding linlog kineti s [VisserandHeijnen , 2003℄, loglin kineti s [HatzimanikatisandBailey, 1997℄, power-law kineti s [Savageau, 1976℄, and more re ently, onvenien e kineti s [LiebermeisterandKlipp,2006℄.

Linlogmodelsareaparti ularlyinteresting hoi eformodelingmetabolism [Heijnen , 2005, VisserandHeijnen , 2003℄. Simulation studies on the level of both individual enzymati rea tions [Heijnen, 2005℄ and metaboli networks [Costaetal., 2010, Hadli h etal., 2009, Visseret al., 2004℄ have shown that theyprovidereasonableapproximationsof lassi alenzymati ratelaws. More-over, a re ent genome-s ale linlog model of yeast metabolism, parametrized usingpreviously-publishedkineti models, hasbeenshownabletoidentify key stepsinthenetwork,thatis,rea tionsexertingmost ontroloverglu ose trans-portandbiomassprodu tion[Smallboneet al.,2010℄.

A majoradvantageof linlog models is that, when measurementsof uxes, enzyme on entrations, and metabolite on entrations are available, the pa-rameterestimationproblemredu estomultiplelinearregression. However,the performan eofregressionapproa hesqui klydegradesin thepresen eof miss-ing data,as isoften the asein high-throughputdatasets due toexperimental limitationsorinstrumentfailures.

Inorder to deal with this problem, weproposein this paper a maximum-likelihood method for the identi ation of linlog models of metabolism from in omplete datasets. The spe i ontributions of thepaperare twofold. On thetheoreti alside,wedevelopamethodfortheoptimizationofthelikelihood basedonExpe tationMaximization(EM)[Dempsteret al.,1977℄. Themethod is onstru tedforlinlogmodels,butismoregenerallyappli abletoother regres-sionproblems. Inparti ular,wederiveanalyti alexpressionsfortheexpe tation stepthatarewell-suitedfornumeri almaximization.Thisguaranteesthe appli- abilityoftheapproa hevenwhenmodelinglargenetworks. Weshowbymeans ofsimulationexperimentsonsyntheti datathatourapproa houtperformsboth regressionandareferen emethodfromstatisti alliteraturefordealingwith

(7)

in- omplete data, multiple imputation [Rubin , 1976, 1996℄. In omparison with earlierworkontreatingin ompletehigh-throughputdatasets[Obaetal.,2003, S holzet al.,2005℄,ouraimisnottoestimatethemissingvalues,butratherto improvetheestimationof themodelparametersfrom thein ompletedatasets. Thisisadierentproblemthatne essitatesthedevelopmentofnovelmethods. On the biologi al side, we apply the method to a linlog model of entral metabolism in E. oli, onsisting of some 16 variables. We estimate the 100 parametersofthismodelfromahigh-throughputdatasetpublishedinthe liter-ature[Ishiietal.,2007℄. Thedata onsistsofmeasurementsofmetaboli uxes andmetaboliteandenzymelevelsinglu ose-limited hemostatunder29 dier-ent onditionssu hawild-typestrainandsingle-genemutantstrainsordierent dilution rates. Standardlinear regressionis di ult to apply in this asedue tomissingdata,whi hdisqualiesfor7rea tionstoomanydatapoints,leaving adatasetofsize inferiortothenumberofparameterstoestimate. Appli ation of our approa h allows one to ompute reasonable estimates for most of the identiablemodelparametersevenwhenregressionisinappli able.

2 Parameter estimation in linlog models

Thedynami sofmetaboli networksaredes ribedbykineti modelshavingthe formofsystemsofordinarydierentialequations(ODEs)[Heinri handS huster, 1996℄:

˙x = N · v(x, u, e)

(1)

where

x ∈ R

n

+

denotesthe ve torof(nonnegative)internal metabolite on en-trations,

u ∈ R

p

+

theve torof externalmetabolite on entrations,

e ∈ R

m

+

the ve torofenzyme on entrations,and

v : R

n+p+m

+

→ R

m

theve torofrea tion ratefun tions.

N ∈ Z

n×m

isastoi hiometrymatrix.

Therea tionrates

v

arenonlinearandgenerally omplexfun tions of

x

,

u

, and

e

,withmanykineti parametersthataredi ulttoreliablyestimatefrom the data. This has motivated the use of approximate ratefun tions, like the linear-logarithmi (linlog) fun tions onsidered in this paper [Heijnen , 2005, VisserandHeijnen , 2003℄. The linlog approximation expresses the rea tion rates as proportional to the enzyme on entrations and to a linear fun tion ofthelogarithmsofinternalandexternalmetabolite on entrations. Thisleads totherateequation

v(x, u, e) =

diag

(e) · a + B

x

_{· ln(x) + B}

u

_{· ln(u)}

(2) wherethelogarithmofave tormeanstheve toroflogarithmsofitselements. For on iseness,inthesequelweshalldropthedependen eof

v

on

(x, u, e)

from thenotation. Anin-depthdis ussionoflinlogmodelsand omparisonwithother approximativeratefun tions anbefoundinthereviewby[Heijnen ,2005℄.

We are interested in the estimation of the (generally unknown) parame-ters

a ∈ R

m

,

B

x

_{∈ R}

m×n

and

B

u

_{∈ R}

m×p

from

q

experimental datapoints

(v

(k)

_{, x}

(k)

_{, u}

(k)

_{, e}

(k)

₎

,

k = 1, . . . , q

. That is, the data used for parameter esti-mation are parallel measurements of enzyme and metabolite levels as well as metaboli uxes. Noti ethatinpra ti erea tionratesaremeasuredat (quasi-)steadystate. Thedatapoints

(v

(k)

_{, x}

(k)

_{, u}

(k)

_{, e}

(k)

₎

(8)

experimental onditions,forinstan edierentdilutionratesin ontinuous ul-turesordierentmutantstrains. Forthepurposeofparameterestimation,itis onvenienttorewrite(2)in theform ofaregressionmodel:

v

e

T

= [1 ln(x)

T

_ln(u)

T

_{] ·}





a

T

(B

x

₎

T

(B

u

₎

T





(3)

where the ratio of two ve tors (here

v/e

) denotes elementwise division. Let us use an upperbar to denote the mean of aquantity overits

q

experimental observations,forinstan e:

v/e = (1/q)

P

q

k=1

v

(k)

/e

(k)

. Bythelinearityof (3), itholdsthat

v

e

= [1 ln(x)

T

ln(u)

T

] ·





a

T

(B

x

₎

T

(B

u

₎

T





(4)

Thisallows(3)tobereformulatedasamean-removedmodel

v

e

−

v

e

T

=

ln(x) − ln(x)

ln(u) − ln(u)

T

· (B

x

₎

T

(B

u

₎

T

(5)

andweobtainthefollowingparameterestimationproblem: Problem 1 Given thedata matri es







v

(1)

e

(1)

−

v

_e

T

. . .

v

(q)

e

(q)

−

v

_e

T







|

{z

}

,W

,







ln(x

(1)

_{) − ln(x)}

T

_ln(u

(1)

_{) − ln(u)}

T

. . . . . .

ln(x

(q)

_{) − ln(x)}

T

_ln(u

(q)

_{) − ln(u)}

T







|

{z

}

,Y

nd parameters

C

,

B

x

_B

u

T

solvingthe regressionproblem

W = Y · C + ε

(6)

where

ε ∈ R

q×m

ismeasurement noise on

W

.

Noti ethattheparameterve tor

a

nolongerappearsintheregression prob-lem, butanestimateof it anbere overedfrom estimatesof

C =

B

x

_B

u

T

bywayofEq.(4).

Inthe remainderof the paper, we makethe assumption that ea h olumn

ε

·i

of

ε

followsaGaussiandistribution,indi atedby

ε

·i

∼ N (0, Σ

ε

i

)

,where

Σ

ε

i

is diagonal,i.e. the measurementerrorsin dierentexperimentsare mutually un orrelated. Wefurtherassumethat

ε

.i

isindependentof

ε

.j

for

i 6= j

. Then, Problem 1 an be subdivided into

m

independent subproblems, one for ea h rea tion

i

:

w

·i

= Y · c

·i

+ ε

·i

,

(7) where

w

·i

and

c

·i

arethe

i

-th olumnsof

W

and

C

,respe tively.

Thevalues ofthe parametermatri es

B

x

and

B

u

admit an interesting bi-ologi al interpretation. Noti ethat one animmediately ndvalues

x

0 ∈ R

n

+

,

(9)

u

0 ∈ R

p

+

,

e

0 ∈ R

m

+

and

v

0 ∈ R

m

su h that

v

0 /e

0 = v/e

,

ln x

0 = ln(x)

, and

ln u

0 = ln(u)

. As a onsequen e,Eq. (5) anbe rearrangedinto the ommon relativeformulationoflinlogmodels,

v

e

=

diag

v

0 e

0 1 + B

x

0 ln

x

0 + B

u

0 ln

u

0

(8)

where

1

is an

m × 1

ve torofones,

(v

0 , x

0 , u

0 , e

0 )

isaso- alledreferen estate [Heijnen ,2005℄and

B

x

0

,

B

u

0

arematri esofelasti ity onstants,where

B

0 x

=

diag

e

0 v

0 · B

x

,

B

u

0 =

diag

e

0 v

0 · B

u

.

(9)

Theelasti ities,introdu edinthe ontextofMetaboli ControlAnalysis(MCA) [Heinri handS huster,1996℄,des ribethenormalizedlo alresponse ofthe re-a tionratesto hangesin metabolite on entrations. Theinterestis thatthey anthusbeimmediately omputedfrom thevaluesof

B

x

and

B

u

foundbythe solutionofProblem1,andtheequality

e

0 /v

0 = 1/(v/e)

.

Althoughstraightforwardin theory,solving theregressionproblem (6) en- ounterstwo ompli ationsinpra ti e.

1. Sin ethe measurementsare arried outat (quasi-)steadystate, wehave

N · v(x, u, e) = 0

. Thisintrodu esdependen iesamongthedataandthus redu estheinformation ontentofthedatamatrix

Y

,inthesensethat

Y

be omesrankde ient. Likein earlierwork[Nikerel etal.,2006℄, weuse standardapproa hestosolvethisproblem. WenotablyrelyonPrin ipal ComponentAnalysis(PCA)[Jollie,1986,Nikerelet al.,2006℄appliedto thedatamatrix

Y

toredu ethemodelorder,i.e.,thenumberof indepen-dentparameters,andensurewell-posednessoftheregressionproblem(see AppendixAforte hni aldetails). Insummary,weuseSingularValue De- omposition(SVD),ate hniquede omposingthedatamatrixinto domi-nantandmarginal omponentsa ordingtoavarian e riterion. Forthe purposeoflinearregression,this orrespondstode omposing the param-eterve torintoaredu ednumberof omponentsthat anbedetermined with ertainty based on the data, while the remaining omponents are poorlydetermined,i.e., theyare`nonidentiable',andaredis ardedwith negligibleee t on the t. We note in passing that the olumns of

W

and

Y

arezero-mean,animportantrequirementforthe orre tnessofthe outlinedanalysis.

2. The high-throughput datasets ontain a substantial amount of missing values,due toexperimentallimitationsorinstrumentfailures. If,forany givenrea tion,weonlyusedthedatapointsinwhi hallrelevantmetabolite on entrations,enzyme on entrations,andmetaboli uxesplayingarole inthat rea tionareavailable,then alargeamountofdatawouldhaveto be thrownaway. Inpra ti e,we would run theriskthat theparameters annotbereliablyidentied. Thedevelopmentofamethodthatis apable ofmaximallyexploitingtheinformation ontainedinin ompletedatasets forsolvingProblem 1is the main subje t of thepaper and willbe fully developedinthelaterse tions.

(10)

3 Likelihood-based identi ation of linlog models from missing data

Foreveryrea tion

i

, weare on ernedwiththeproblem ofestimating the un-known parameters

c

·i

of the model given in (7) in the ase where some en-tries of

Y

are unknown. We address the estimation problem by a likelihood-maximizationapproa h, whi h isknown to yield optimal(unbiased and mini-mum varian e)estimates for ourproblem setting in the ase where

Y

is fully known. As theproblemis identi al forallrea tions

i

, in theremainder ofthe se tionwewilldropforsimpli ityindex

·i

fromthenotation.

Let

I

bethesetofindi es(row, olumn) orrespondingtotheknownentries of

Y

,i.e.,

(j, k) ∈ I

ifandonlyif

Y

j,k

isavailable. Itis onvenienttointrodu e thede omposition

Y = ˘

Y + e

Y

,where

˘

Y

j,k

=

(

Y

j,k

,

if

(j, k) ∈ I ,

0,

otherwise

;

e

Y

j,k

=

(

0,

if

(j, k) ∈ I ,

Y

j,k

,

otherwise

.

Matrix

Y

˘

isfullydetermined: On emeasurements

y

˘

of

Y

˘

are olle ted,wetreat

˘

Y = ˘

y

as xed parameters of the regressionproblem. Matrix

Y

e

olle ts the unknownentriesof

Y

. Wemodelthesemissingdataasunobservedindependent randomvariables,whosepriordistributionsen odeourgeneri knowledgeabout them. Assumingthatthea-priori distributionsarenotknown(worst ase),we deneaGaussianpriorforea hquantitythatismissinginanexperimentbased on the measurements of the same quantity available from other experiments. Forevery

(j, k) 6∈ I

and

Y

j,k

= {Y

j

′

,k

: (j

′

_{, k) ∈ I }}

(assumednonempty),we let







e

Y

j,k

∼ N (µ

j,k

, σ

j,k

2 ),

µ

j,k

=

mean

(Y

j,k

),

σ

j,k

=

std

(Y

j,k

).

(10)

We annowformulatetheestimationproblem.

Problem 2 Given measurements

W = w

and

Y = ˘

˘

y

, ompute the estimate

ˆ

c = arg max

c

log L (c)

, with

L

(c) = f

W

|˘

y,c

(w)

, where, for any

c

,

f

W|˘

y,c

(·)

is the probability density fun tion of

W

given

Y = ˘

˘

y

orresponding to model (7) (10) .

Notethat

L

(c)

isalikelihoodfun tionforalinearmodelwithmissingdata, in the sense that it is dened with respe t to available data

Y

˘

only. One an express

L

(c)

bymarginalization,

log L (c) = log

Z

f

W

|˘

y,˜

y,c

(w)f

Y

e

|˘

y,c

(˜

y)d˜

y,

(11)

where

f

W

|˘

y,˜

y,c

(·)

is the standardlikelihood fun tion for model (7) given

Y =

˘

y

and

y

˜

, with

y

˜

varying over all possible values of

Y

e

, and

f

e

Y

|˘

y,c

= f

Y

e

|˘

y

is determined bythe prior(10). Theexpli itsolution tothe integralisreported inAppendixB. Adire tapproa htosolvingProblem2istomaximize(11)by numeri al optimization. However,the fun tion is not onvexin

c

, when e its

(11)

dire t optimization isproneto endupin lo al minimaand requiresthe useof globaloptimizationstrategies.

Alternatively,weproposetota kleProblem2byanExpe tation-Maximization (EM)algorithm[Dempsteret al.,1977℄. EMprovidesageneralmethodologyfor theoptimization ofalikelihoodfun tion with missinginformation. It isbased onaniterativetwo-steppro edurethat,fortheproblemathand,weimplement as follows. Let us dene the random variable

Z = e

Y · c

, so that model (7) be omes

W = ˘

Y · c + Z + ε

. Note that

Z ∼ N (µ

y,c

˘

, Σ

y,c

˘

)

, where for any given

c

,meanandvarian e anbederivedfrom(10). Let

ˆ

c

0

beaninitialguess of

c

. Atevery iteration

ℓ = 1, 2, 3, . . .

, ompute an updatedestimate

ˆ

c

ℓ

from theestimate

ˆ

c

ℓ−1

atthepreviousiterationbyperformingthefollowingEMsteps: Expe tation: Compute

Q(c|ˆc

ℓ−1

) = E[log f

Z,W|˘

y,c

(Z, w)|˘y, ˆc

ℓ−1

, w]

=

Z

log f

Z,W

|˘

y,c

(z, w)f

Z|˘

y,ˆ

c

ℓ−1

,w

(z)dz;

(12) Maximization: Solve

ˆ

c

ℓ

= arg max

c

Q(c|ˆc

ℓ−1

_).

(13) In (12),

f

Z,W

|˘

y,c

is the joint probability density fun tion of

Z

and

W

given

˘

Y = ˘

y

and

c

, while

f

Z|˘

y,ˆ

c

ℓ−1

,w

is the probability density fun tion of

Z

given

˘

Y = ˘

y

,

W = w

and

ˆ

c

ℓ−1

. Infa t,thisquantityisindependentof

w

and anbe omputedbyourdenition (10).

It anbeproventhat,at everyiteration

ℓ

, theEMalgorithmin reasesthe valueof

L

(ˆ

c

ℓ

₎

,andeventually onvergestoamaximumof

L

[Littleand Rubin, 2002℄. While this is notne essarily aglobal maximum, EM has proven ee -tivein many appli ations[Graham,2009,Hortonand Kleinman,2007℄. A key propertyisthat onvergen etoamaximumisa hievedevenif(13)isnotsolved exa tly: Itsu esthat

ˆ

c

ℓ

issu hthat

Q(ˆ

c

ℓ

_|ˆc

ℓ−1

_{) ≥ Q(ˆc}

ℓ−1

_|ˆc

ℓ−1

₎

,whi hiseasily a hieved even by alo al optimization algorithm. In pra ti e, we anuse the expli itexpressionof

L

inProblem2forstoppingtheiterations,e.g.,whenthe relativeimprovementon

L

fallsbelowaspe iedthreshold

τ > 0

:

|L (ˆc

ℓ

) − L (ˆc

ℓ−1

)|/|L (ˆc

ℓ

)| ≤ τ.

To ompletethe implementation ofthealgorithm, onemust express

Q(c|ˆc

ℓ−1

₎

in aform onvenientformaximization. As explainedin Appendix B, wewere abletoexpress(12)asanexpli itfun tion of

c

foranygiven

ˆ

c

ℓ−1

. In ompa t form:

Q(c|ˆc

ℓ−1

) ∝ −KL(f

c

||f

ˆ

c

ℓ−1

) − H(f

ˆ

c

ℓ−1

) + log(κ

f

c

),

(14) where

f

c

standsforaGaussiandistributionwithvarian e

Σ

f

c

= [Σ

−1

ε

+ Σ

−1

˘

y,c

]

−1

and mean

µ

f

c

= Σ

f

c

· (Σ

−1

ε

· (w − ˘y · c) + Σ

−1

˘

y,c

· µ

y,c

˘

)

,

κ

f

c

is a fun tion de-pendingon

c

via

µ

f

c

and

Σ

f

c

,and theproportionalityfa torthat wedropped (indi ated bythe presen e of

∝

in pla e of

=

)depends on

ˆ

c

ℓ−1

but noton

c

. Finally,

KL(·||·)

and

H(·)

are theKullba k-Leiblerdistan e between distribu-tions and the entropy of a distribution, respe tively, for whi h, in the Gaus-sian ase at hand, expli it formulas are available [CoverandThomas, 2006,

(12)

StoorvogelandvanS huppen,1996℄. Aslightte hni al ompli a yisneededin ase

Σ

y,c

˘

issingular(seeAppendix Bforallthemathemati aldetails).

Theavailability of the losed-form expression (14)allows us to implement EM e iently, i.e., with an expli itmaximization problem that is solved nu-meri allyat alliterations. On etheparameterestimatesare obtained,several methods from the literature an be used to assess the a ura yof the results by inferring onden e intervals. Examples are randomized methods su h as bootstrapping[Manly, 1997℄andtheprolelikelihoodmethod by[Raueet al., 2009℄. Thismethodderives onden eintervalsusingathresholdonafun tion alled theprolelikelihood. Inourappli ation, this isobtainedseparately for ea h parameter

c

j

by re-maximization of (11) with respe t to all parameters

c

k6=j

,forallvalues

c

j

inaneighborhoodof

c

ˆ

j

.

4 Validation on syntheti data

Beforeapplying theEM algorithmtoa tual biologi alidenti ation problems, wetesttheperforman e ofthemethod onsimulateddata. Forthis purpose,a syntheti modelhasbeendeveloped,asimplied variantof thelinlogmodelof E. oli entralmetabolismstudiedinSe .5below. Themodel,intheform(2), ontains17variables,representinginternalandexternalmetabolitesinvolvedin 25rea tions,and78parameters(seeAppendixCforthemodelequations). We generatedatamatri es

Y

from thismodelbymeansofsimulation,fordierent per entagesofmissingdataandexperimentalnoise. Usingthemodelstru ture andthesimulateddata,wesolveProblem1forea hrea tionindependently,as des ribedin Se .3.

Inorder to assess the added valueof our spe i implementation of likeli-hood optimization, we rst ompare theperforman e of the EM algorithm of Se .3withthedire t maximizationoftheloglikelihood(11)implementedwith ageneral-purposeMatlab optimizationroutine. Thismethod willbereferred toasMaxLLin thesequel.

Se ond,we omparethelikelihood-basedidenti ationapproa heswith stan-dardmethods,notablylinearregression(referredtoasRg)andthe ommonly-used multipleimputation (MI)method[Rubin, 1976,1996℄. Regressionis per-formedbasedonfulldatasetsonly,i.e.,itdoesnot onsideran experimentally-determined datapoint

(v

(k)

_{, x}

(k)

_{, u}

(k)

_{, e}

(k)

₎

when at least one of the measure-mentsismissing. MI isbasedonimputation ofmissingdatabyrandomdraws of themissing values, i.e., non-zeroelementsof

Y

˜

, from thea-priori distribu-tion dened in (10). Both methods thus exploit onlypart of the information ontainedinanin ompletedatasetandprovidealowerlimitforquantifyingthe performan eofthemethodsproposedin Se .3.

Third, we ompare the results of EM with the least-squares identi ation of the model on omplete datasets (a method referred to as RgF). Though inappli abletorealdatawithmissingmeasurements,themethodisstatisti ally optimal. Hen e,itprovidesus anupperperforman e bound that anbe used to assesstherole ofmissing datain performan edegradation,separatelyfrom theroleofnoise.

Mostof thehigh-throughput datasets available in the literaturehavebeen obtainedwhen metabolismis at (quasi-)steady-state. Inorder to mimi avail-ableexperimentaldataas loselyaspossible,simulateddataobtainedfromthe

(13)

syntheti model should therefore be steady-state data. We generated steady statesof(1)-(2) ,andre ordedthe orrespondingmetabolite on entrationsand metaboli ux valuesfor 30 dierent onditions, ea h onsisting of a random hangeintheenzyme on entrationwithrespe ttoareferen evalue.

We omparedperforman eofthevemethodsdes ribedabove(EM,MaxLL, MI,Rg,RgF)ondatasetswithdierentamountsofmissingdata(40%and75%) for themetabolite on entrations andnoise levels(10% and 20%) for

w

. The onlydieren ewiththedatasetusedforthereferen emethod RgFisthat the latter has no missing data. A noiselevelof 10%means that the distribution usedto generatethenoisehasastandarddeviationequalto10%ofthevalues in

w

. Theper entagesof missingdata inthesimulationstudy are omparable tothoseobservedinpra ti e(Se .5and[Ishiiet al.,2007℄). Foreverydierent ombinationofmissingdataper entageandnoiselevel,adatasetwasgenerated byhomogeneouslydistributingmissingdataamong olumnsof

Y

,theindi esfor ea h olumn being hosenat random. Foreverysimulateds enario,randomly generatednoisewasaddedto

w

inthedataset.

For all the above s enarios, identi ation of ea h rea tion was addressed separately, in a ordan e with the dis ussion of Se . 2. For every rea tion, werst tested the identiability of the syntheti linlog model by PCA of the full datamatrixY. Inoursimulation,9rea tionsoutof the25 omposing the modelweredete tedashavingnonidentiableparameters. Forthoserea tions, identi ationofaredu edordermodel

w = Y

∗

· c

∗

+ ε

(15)

wasperformedin pla e oftheidenti ationof theoriginalmodel.

Y

∗

_{∈ R}

q×r

, with

r ≤ n+p

,isaredu ed-orderdatamatrixobtainedbylineartransformation of

Y

and

c

∗

_{∈ R}

r

isaparameterve tor,smallerthan

c

,thatis`identiable',in thesensethat itiswelldeterminedbythedata(seeAppendixC).

WeimplementedthedierentparameterestimationalgorithmsinMatlab, usingthels ovfun tionfortheregression-basedmethodsandfminsear hfor globaloptimizationinMaxLLandthemaximizationstepinEM.BothEMand MaxLLrequireaninitialguessoftheparameterstobespe ied. Weproposed 10 dierent initial parameterve tors, in luding the estimation obtained with the baseline method Rg where available. In order to draw statisti s for the estimation performan e, ea hofthevealgorithms wasappliedon100Monte Carlorepetitionsoftheidenti ationproblem. The ompleteperforman etest over all methods, onditions and 100 repetitions took about 7 h 40 min in Matlab 7.4.0onaLinuxPCworkstation(1862MhZ,2GbRAM).

Themostinformativeresultsfromallidenti ationmethodsaresummarized byboxplotsoftheratiooftheestimatedparametervalues

c

overthereferen e parametervalues

c

ref

usedto simulate the data, the losertheratio to

1

, the better the estimates. Ensemble statisti s are drawn for all parameters orre-sponding to the samerea tion. Fig. 1 is dedi ated to the s enario with 40% missing data and 10%noise,whereas Fig.2reportson 75%missing data and 20%noise. Completeresultsforallrea tionsunderall onditions anbefound in AppendixC.

Sin e the individual rea tions of the model involveonly a small subset of metabolites,ea hofthe

m

identi ationsubproblems onsistsoftheestimation ofalimitednumberofparameters,mostly2or3. Forthe asewith40%missing

(14)

Figure 1: Statisti s of estimated parameter values for datasets with 40% of missing dataand 10%noise. Theresultsareshownasboxplotsof theratioof theestimated parametervalues

c

andreferen e parametervalues

c

ref

. Statis-ti shavebeen omputedforea hofthe5methodsfrom100datasets. Forea h method,theredlinedisplaysthemedianandthelowerandupperbluelines rep-resentthelowerandupperquartile values,respe tively. Whiskersextendfrom ea hendoftheboxtothemostextremevalueswithin1.5timesthe interquar-tile range from the ends of the box and outliers are shown with red rosses. ThetestedalgorithmsareExpe tationMaximization(EM),dire toptimization of loglikelihood (MaxLL), multiple imputation (MI), regressiononin omplete datasets (Rg) andregressionon ompletedatasets (RgF). (AD)Boxplotsfor rea tions3,4,11and 18of thenetwork,respe tively.

(15)

Figure 2: Statisti s of estimated parameter values for datasets with 75% of missingdataand20%noise.Thegraphi alnotationsarethesameasforFig. 1. (AF)Boxplotsforrea tions3,13,17,22,19and22ofthenetwork,respe tively.

(16)

data,Rg anthereforebeperformedinallrunsforeveryrea tionofthemodel. On the ontrary, with

75%

missing data, regression annot be applied to 6 rea tionswhi hisapparentfromtheabsen eoftheRgstatisti sfor2rea tions in Fig.2.

In omparisonwiththeother methods, multiple imputation(MI)givesthe worstresults(largest bias)in 3outofthe4rea tionsshownin Fig. 1, andin 5out of6rea tionsin Fig. 2. In rea tions11 of Fig. 1and22 of Fig. 2,the relativelysmallbiasesarea ompaniedbyanestimationun ertaintywiderthan forEMandMaxLL.This ouldbeexplainedbyarestri teduseofinformation ontainedinthedistributionofmissingdata. Indeed,MIonly onsidersrandom draws from the distribution while EM and MaxLL are based on all possible valuestakenbymissingdata throughintegrationofthedistribution.

Analysis of Fig. 1 reveals that, for 40% missing data and 10% noise, the performan e of EM and MaxLL is almost identi al and similar to that of re-gression(RgandRgF),withlimitedimprovementsonRg,i.e., slightlysmaller variability. Insome ases, su h asfor rea tions 11 and 18, their performan e approa hestheoptimal, unattainable bound providedby RgF, i.e., theyhave similarbiasandvariability.

Performan e improvements of likelihood-based methods over Rg be ome more signi ant when identi ation is performed on the dataset with higher per entage of missing data and larger noise. Fig. 2A-D show results for re-a tions where Rgwasappli able. Both EM and MaxLL substantiallyredu e estimation variabilityin rea tions3,17 and 22. Atthesametime, due to the largeramountofmissingdata,performan elosswithrespe ttoRgFismore sig-ni ant. Turnedanotherway,thisshowsthea ura ythat ouldbere overed werealldatasets omplete.

Fig. 2E-F showtheresultswhen Rgfailsto produ eestimatesand annot beusedtoinitializeEMandMaxLLoptimization. Still,EMprovidesestimates of therightorder of magnitudeand, for the ase ofFig. 2E, of therightsign in at least 75%of the runs (box entirely above

0

), while the median has the rightsignandisreasonably loseto1. Theestimationofthesignprovidedby MaxLLislessreliable(box rossing

0

).

Overall,we on ludethattheEM-basedapproa hprovidesthemosta urate estimatesunder allsimulated onditions. Wewill thereforeapply thismethod to theidenti ationofthelinlogmodel ofana tualmetaboli network froma published high-throughputdataset.

5 Appli ation to entral metabolism in E. oli Thenetworkof entral arbonmetabolisminEs heri hia oli hasbeenstudied foralongtimefromdierentperspe tives,whi hmakesitanidealmodelsystem forourpurpose. Aratherpre iseideaofthestru tureofthenetworkexists, sev-eral kineti modelsof thenetworkdynami s areavailable ([Bettenbro ket al., 2005,Kotteetal.,2010℄andreferen estherein),andre entlyahigh-throughput dataset ontaining the required information for solving Problem 1 has been published [Ishiietal., 2007℄. The network we onsider here gathersenzymes, metabolites and rea tions that make up the bulk of E. oli entral arbon metabolism, in luding gly olysis, the pentose-phosphate pathway, the

(17)

tri ar-boxyli a id y leandanapleroti rea tionssu hasglyoxylateshuntand PEP- arboxylase(Fig.3).

Thedatasetused for identi ation ofthis network wasobtained by exper-iments with 24 single-gene disruptants that were grown at a xed dilution rate of 0.2 h

−1

in glu ose-limited hemostat and on wild-type ells at 5 dif-ferentdilution rates[Ishiietal., 2007℄. Theauthors olle teddata using mul-tiple high-throughput te hniques, in parti ular DNA mi roarray analysis and two-dimensional dierential gel ele trophoresis (2D-DIGE) for genes and pro-teins, apillary ele trophoresistime-of-ightmass spe trometry (CE-TOFMS) formetabolites,andmetaboli uxanalysis.Theythusobtainedadataset on-sistingofmetabolite on entrations,mRNAandprotein on entrationsforthe enzymes, and metaboli uxes under 29 dierent experimental onditions. A largenumber ofdierentmetabolites were measured in theexperiments, with missing data in varying amounts, from 0to 80% of the observations, 28%on averageforthemetabolites onsideredbelow.

Fromtherea tionslistedin[Ishiiet al.,2007℄, wehave onstru tedalinlog modeloftheform(2),with

n = 16

internalmetabolites,

q = 7

external metabo-litesandmeasured ofa tors,and

m = 31

rea tions(seeAppendixD). Ea hof therea tionsis atalyzedbyasingleenzyme,whi hmaya tuallystandfor sev-eralenzymesinthe aseofisoenzymes,enzyme omplexes,orlumpedrea tions. Rea tionshavebeensimpliedorlumpedtogetherwhenasharedmetabolitehas not beenmeasured, whi h pre ludesestimation ofthe orresponding elements in the parameter matri es

B

x

and

B

u

. In omparison with an earlier linlog model of E. oli entral arbon metabolism [Visseretal., 2004℄, weextended thes opeto in lude thetri arboxyli a id y leandthe glyoxylateshunt,but dueto theabove-mentionedsimpli ationsourmodelis more oarse-grained.

Anidentiabilityanalysis wasperformedbyseveralroundsofmissing data imputationusingthea-priori distributiondened inEq.(10)andPCA,whi h ledinea h asetothesameresult: 7outof31rea tionsweredete tedashaving nonidentiableparameters. Forthoserea tions,themodelhasbeenredu edas des ribedinEq.(15)usingadatamatrix

Y

ompletedbythemeans

µ

j,k

ofthe a-priori distributions. Foreveryindividual rea tion,the redu edmodel hasa parameterve tor

c

∗

thatisnowentirelyidentiable.

Apartfromthedistributionofthea-priori missingdata, givenbyEq.(10), appli ationoftheEMrequiresinformationaboutthedistributionof

ε

,theerror on theratios of uxesand enzyme on entrations. The Ishiidataset provides several repli ameasurementsforareferen eexperimental ondition: wild-type ellsgrownonglu ose-limited hemostatwithadilutionrateof0.2h

−1

. These data were used for the omputation of the varian e of

ε

. Running the EM methodonthemodelandthedatatookabout220susingtheimplementationof Se .4.Inordertoassessthea ura yoftheestimated

B

x

and

B

u

,we omputed forea hparametera95% onden einterval,bymeansoftheprolelikelihood method outlinedin Se .3. The omputationof the onden e intervalsforall parametersrequiredabout23min.

Contraryto thesimulationstudies reported in Se .4, areferen e or`real' model for the evaluation of the results doesnot exist in this ase. However, a-priori bio hemi alknowledgeonthesignsof theelasti itiesisavailable,i.e., elasti ities are positive for substrates and negative for produ ts. This infor-mation anbe omparedwith theestimated signsoftheelasti ities, andtheir onden eintervals, omputedfrom theparametermatri esusingtherelations

(18)

Figure 3: S heme of Es heri hia oli entral arbon metabolism. This map, showing metabolites (bold fonts) and genes (itali ) is adapted from [Ishiiet al., 2007℄. Abbreviations of metabolites are glu ose (Gl ), glu ose 6-phosphate(G6P),fru tose6-phosphate(F6P),fru tose1-6-biphosphate(FBP), dihydroxya etone phosphate (DHAP), gly eraldehyde 3-phosphate (G3P), 3-phosphogly erate (3PG), phosphoenolpyruvate (PEP), pyruvate (Pyr), 6-phosphoglu onate(6PG),2-keto-3-deoxy-6-phospho-glu onate(2KDPG), ribu-lose 5-phosphate (Ru5P), ribose 5-phosphate (R5P), xylulose 5-phosphate (X5P), sedoheptulose 7-phosphate (S7P), erythrose 4-phosphate (E4P), ox-aloa etate (OAA), itrate (Cit), iso itrate (IsoCit), 2-keto-glutarate (2KG), su inate-CoA(Su oA), su inate(Su ), fumarate(Fum), malate(Mal), gly-oxylate(Glyox),a etyl-CoA(A oA),a etylphosphate(A p)anda etate(A e). Cofa torsimpa tingtherea tionsarenotshown. Thegenenamesareseparated by a ommain the ase ofisoenzymes, bya olon forenzyme omplexes, and by a semi olon when the enzymes atalyze rea tions that have been lumped togetherin themodel.

(19)

in Eq. (9) . The results are shown in Table 1. Similar unshown results are obtainedbymeansoftheMaxLLmethod.

WeobservethattheEMmethodobtainsestimatesforallrea tions,in luding the7 aseswheretheinsu ientamountofdatamaderegressionnotappli able. However, 27 of the 100 non-zero elasti ities of the model are not identiable from this dataset. Moreover,outoftheremaining 73elasti ityestimates,half of them havesigns that are not statisti ally signi ant, in the sense that the 95% onden e interval straddles0. This ismost likely due tothe magnitude of noise in metabolite on entrations being omparable to the magnitude of relevantinformation,asforexampleforPEPwherethestandarddeviationover all experimental onditions equals the standarddeviation of the repli ates in asingle ondition(

0.06

mMvs

0.05

mM). Thispre ludestheestimation ofan unambiguoussign.

Oftheelasti itieswithstatisti allysigni antsigns,20outof37are orre t, inthesensethattheyhavetheexpe tedpositiveornegativesign. Theremaining elasti ities,distributedover11rea tions,arein orre tlyestimated. Letusnow dis uss what we believe are potential sour es of these errors, an information that ouldbeusedtosingleouterroneousestimatesa-priori.

Werstnotethatfor5ofthese11rea tions(GapA;Pgk,GltA,PrpC,Mdh, Edd;EdaandPta;A kA,A kB,seeTable1),onlyveryfew ompletedatapoints are available (between 1 and 5) and regressionmostly fails in these ases. In addition, allofthese rea tionsinvolveat leastone metabolite missingin more than70%oftheexperimental onditions. The ombinationofveryfew omplete datapointsandahighper entageofmissingmetabolitemeasurementsobviously makesmodelidenti ationextremelydi ultanditisfairtosaythatherewe rea h the limitof theappli ability of our method, orof any method for that matter,due tothela kofdata.

Se ond,4rea tionsareknowntooperate losetoequilibrium: Pgi,FbaA,FbaB, TpiAandGpmA,GpmB;Eno[Visseret al.,2004℄. Theoreti ally,theserea tions arenotidentiable,astheirelasti itiesarenotindependent[Visseret al.,2004℄, but PCA did notdete t this. Most likely, this is due to the above-mentioned noise in metabolite on entrations,whi h de reases their orrelations. A au-tious,preemptivestrategywouldbetoredu ethemodelforanyrea tionknown tobe losetoequilibriumandeliminatethe orrespondingdependentvariables. The errors in the signs of some elasti ities in the remaining 2 rea tions (PtsG andPfkA,PfkB) are less straightforwardto explain. It is unlikelythat they anbe attributed to theEM method, giventhat regressionis appli able herewitharelativelylargenumberof ompletedatapointsavailable(

14

and

18

, respe tively) andgivesthesameresults. Alternatively,they maybeexplained by a modeling error or a hidden variable, for instan e an unknown ofa tor, biasingtheestimationresults. Itisalsopossiblethattheapproximationsofthe linlogmodelarenotsuitableforthese rea tions,forinstan ebe ausethereare large variations in metabolite on entrations between onditions, driving the systemfarfromthereferen estate.

Insummary,EMgivesreasonableresultsforafairly ompli atedmodelon a hallenging dataset. Even though some puzzling issues remain, we believe thatthese anbesafelyattributedtotheinherentdi ultyoftheidenti ation problem.

(20)

ation of metab oli mo dels fr om in omplete datasets 17

Table 1: Elasti ity matrix

[B

0 B

0 ]

estimated by EMfrom the data ofIshiiet al.[2007℄ for thelinlog model of E. oli entral arbon metabolism (the olumns of the matrix have been permuted for readability). Unidentiable elasti ities are shown in grey, un ertain elasti ities,i.e., havingasign thatis notsigni antwith95% onden e, in yellow, and orre tly/in orre tlyidentied elasti ities,i.e., havingasignthatissigni antwith95% onden e,ingreen/red. AbbreviationsareasinFig. 3. Someof the ofa torsaremodeledas ratiosofmetabolite on entrations,e.g. ATP/ADP.Rea tion27,labeled

µ

,isaphenomenologi alrea tionforbiomassprodu tion. The lastrowindi ates theper entageof missingdatapermetaboliteand theright-most olumn displaystheamountof omplete datapoints availableforea hrea tion. Forrea tionslabeledwith

∗

,regressionwasnotabletoprodu eanyresult.

`

``

`

``

`

``

Enzyme Metabolite

Gl PEP G6P Pyr F6P FBP DHAP 3PG A oA oA

6PG Ru5P R5P S7P 2KG Su Fum Mal ATP ADP

Cit NADPH NADP

NADH NAD

FAD A e # omplete datapoints 1 PtsG 0.29 -0.88 0.79 1.88 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 2 Pgi 0 0 -0.33 0 0.23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27 3 PfkA,PfkB 0 -0.13 0 0 -0.11 -0.23 0 0 0 0 0 0 0 0 0 0 0 0.5 0 0 0 0 0 18 4 FbaA,FbaB 0 0 0 0 0 -0.29 0.07 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 5 TpiA 0 0 0 0 0 0 -0.07 0.22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 6 GapA;Pgk 0 0 0 0 0 -0.18 0 -0.05 0 0 0 0 0 0 0 0 0 0.32 0 0 -0.17 0 0 5 7 GpmA,GpmB;Eno 0 0.26 0 0 0 0 0 -0.12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 8 PykA,PykF 0 0.19 0 0.43 0 -0.14 0 0 0 0 0 0 0 0 0 0 0 0.11 0 0 0 0 0 11 9 A eE:A eF:LpdA 0 0 0 0.05 0 0 0 0 -0.05 0 0 0 0 0 0 0 0 0 0 0 0.27 0 0 6 10 Zwf;Pgl 0 0 -0.1 0 0 0 0 0 0 -0.21 0 0 0 0 0 0 0 0 0 -1.38 0 0 0 2 11 Gnd 0 0 0 0 0 0 0 0 0 0.33 -0.08 0 0 0 0 0 0 0 0 -0.76 0 0 0 2

∗

12 Rpe 0 0 0 0 0 0 0 0 0 0 0.46 0 -0.39 0 0 0 0 0 0 0 0 0 0 28 13 RpiA,RpiB 0 0 -0.64 0 0 0 0 0 0 0 0.26 -0.1 0 0 0 0 0 0 0 0 0 0 0 18 14 TktA 0 0 0 0 0 0 0 0 0 0 0 0.01 -0.1 0 0 0 0 0 0 0 0 0 0 18 15 TalA,TalB 0 0 0 0 -0.25 0 0 0 0 0 0 0 0.29 0 0 0 0 0 0 0 0 0 0 27 16 TktB 0 0 0 0 -0.58 0 0 0 0 0 0.35 0 0 0 0 0 0 0 0 0 0 0 0 27 17 GltA,PrpC 0 0 0 0 0 0 0 0 0.22 0 0 0 0 0.01 0 0 0 0 0.49 0 -0.02 0 0 1

∗

18 A nA,A nB 0 0 0 0 0 0 0 0 0 0 0 0 0 0.99 0 0 0 0 0.55 0 0 0 0 5 19 I dA 0 0 0 0 0 0 0 0 0.07 0 0 0 0 0.002 0 0 0 0 0 0.001 0 0 0 4 20 Su A:Su B:LpdA;Su C:Su D 0 0 0 0 0 0 0 0 0 0 0 0 0 1.26 0.3 0 0 0 0 0 -0.48 0 0 2

∗

21 SdhA:SdhB:SdhC:SdhD 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.08 -0.44 0 0 0 0 0 0.4 0 21 22 FumA,FumB,FumC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.46 0.1 0 0 0 0 0 0 25 23 Mdh 0 0.15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.01 0 0.31 0 -0.12 0 0 3

∗

24 Pp ;P kA 0 0.29 0 0 0 -0.13 0 0 0 0 0 0 0 0 0 0 0.1 -1.15 0.21 0 0 0 0 9 25 MaeB,Sf A 0 0 0 0.08 0 0 0 0 -0.2 0 0 0 0 0 0 0 0.26 0 0 -0.003 0.04 0 0 2

∗

26 A eA;A eB 0 0 0 0 0 0 0 0 -0.11 0 0 0 0 0 -0.01 0 -0.02 0 0 0 0 0 0 25 27

µ

0 0.16 -0.11 0.06 -0.04 0 0 0.11 0.1 0 0 0.09 0 -0.06 0 0 0 -0.44 0 0.05 -0.01 0 0 0

∗

28 Edd;Eda 0 0 0 2.47 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 5 29 Pta;A kA,A kB 0 0 0 0.0015 0 0 0 0 -0.33 0 0 0 0 0 0 0 0 -1.42 0 -0.26 -0.83 0 2.1 2

∗

30 LdhA 0 0 0 2.67 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.11 0 0 6 31 AdhE 0 0 0 0 0 0 0 0 0.91 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28 %MissingData 3 17 0 48 7 34 59 10 3 72 3 38 3 59 3 14 14 0 62 79 79 17 17 ° 7524

(21)

6 Dis ussion

In this work we have addressed the problem of estimating parametersof ap-proximatemodelsof metaboli networks fromin ompletedatasets. Even with the largestdatasets available at present, su h asin [Ishiiet al.,2007℄, the ab-sen eor orruptionofalargenumberofmeasurementsmayredu etheee tive numberofdatapointstoahandfulofexperimental onditions,thusmaking sim-pleregressionte hniquesinee tiveoreveninappli able. Makingfulluseofall theavailable datais thereforeessentialto renderidenti ation well-posed and improvethequalityoftheestimatedmodels.

Tothisaim,wehaveproposedamaximum-likelihoodmethodforthe identi- ationoflinlogmetaboli networkmodelsthat ompensatesforthemissingdata bythe useofstatisti al priors. Wedeveloped analgorithm thatattains maxi-mizationofthelikelihoodbasedonExpe tationMaximization,awella epted paradigmfor thenumeri aloptimization oflikehood fun tions in thepresen e of unobservedvariables. A simplerimplementation based on dire t likelihood maximization viageneral-purpose numeri aloptimization algorithms wasalso onsidered andfoundslightlylesspowerful. Theperforman eofEMwas om-paredto that ofanexisting method of referen e,namelymultiple imputation, and to worst- ase and best- ase s enariosgiven byleast-squares regressionon thesole ompletedatapointsandon ompletedatasets,respe tively. Weshowed thatEMoutperformsmultipleimputationbyawidemargin. In omparisonwith worst- aseregression,itredu estheestimationvariabilityandisabletoprodu e reasonableestimationresultsevenwhenregressiononin ompletedatasetsis in-appli able. Italsoapproa hestheidealperforman e ofregressionon omplete datasets forlowratesofmissingdata,regardlessofnoise.

Basedonthesendings,weappliedEMtotheidenti ationofalinlogmodel forthe entral arbonmetabolismintheba teriumE. oli fromthe experimen-tal datapresentedbyIshiietal.[2007℄. Evenwith thelargeamountof in om-pletedatapoints, due to thedi ultyof experimentallymeasuringmetabolite on entrations,EMwasabletoestimatemany ofthemodelparameters (elas-ti ies)inagreementwiththe urrentunderstandingofthesystem. Thisiseven true for rea tions where the redu ed number of omplete datapoints impairs theappli abilityofleastsquaresregression. Ontheotherhand,the hallenging qualityofthedata shedslightontheperforman elimitsofthemethod, whi h tends to fail whenlargemeasurementnoisemakestheestimation ofsmall pa-rametersstatisti allyunreliable,whenthesamevariable annotbemeasuredin most onditions,orwhenrea tionsoperatenearequilibrium.

Overall,resultsfromthesimulationsandtheappli ationonrealdatashowed that our EM approa h is able to make the most of in omplete, noisy high-throughput datasets for the estimation of parameters in approximate kineti models. Inthefuture,weexpe ttoimproveperforman ebydevelopinga num-berofte hni alpoints,in ludingapproximateanalyti /dedi atednumeri al so-lutionsfortheEMmaximizationsteps,therenementoftheidentiability anal-ysis via SVD of in ompletedata matri es [Brand, 2002℄, and a moredetailed modelingofmeasurementnoise. Itisworthnotingthat, whilethemethodhas been developed on linlogmodels, it is more generallyappli able to any other metaboli network model that an be put in aform linear in the parameters by straightforwardmanipulations, su h as generated massa tion models that provideadvantageswhensomemetabolite on entrationsapproa h0[Savageau,

(22)

1976, delRosarioetal., 2008℄. In addition, estimated parameters of approxi-matemetaboli models,su haselasti itiesoflinlogmodels,provideusefulhints fortheidenti ationofmoredetailed nonlinearkineti models.

Fromabroaderperspe tive,theappli ationoftheEMmethodto aunique multi-omi s datasetforE. oli arbonmetabolismallowedus to isolateissues that are riti alfor theappropriateexploitation ofthe datafor parameter es-timation,andthat mayneedtobetakeninto a ountduringthedesignofthe experiments. Onesu hissueisthatahighper entageofmissingdataforsome oftheindividualvariables,evenatarelativelylowaverageper entageoverthe entiredataset, wasfound to bedetrimental to the identi ation results. This mayinuen e samplingstrategies, expe ially for metabolites that are di ult tomeasure. Anotherissueistheidentiabilityproblems ausedbysteady-state measurements,that annotalwaysberesolvedbygeneti mutation orby vary-ing physiologi al onditions. Fromthis perspe tivetime-resolvedobservations ofthenetworkdynami s arrygreatpromise[Hardimanetal.,2007℄,although mu hmoredemanding experimentally.

Appendi es

A Identiability analysis

First of all, we note that in model (7), for ea h rea tion

i

, only a subset of metabolites isinvolved. Thatis, onlyasubsetof theentries ofve tor

c

.i

need tobeidentied,whilethevaluesoftheremainingentriesarexedto

0

. Letus all

n

c

i

with

0 < n

c

i

< (n + p)

the ee tivenumberof parametersto identify. Astraightforwardreformulationoftheregressionmodelisthefollowing:

w

.i

= Y

′

· c

′

+ ε

.i

(16)

with

c

′

_{∈ R}

n

_ci

a ve tor olle ting the nonzero values of

c

.i

and

Y

′

∈ R

q×n

_ci

amatrix omposedof the orresponding olumnsof

Y

,i.e., themetabolites in-volvedin rea tion

i

. Bearingthis in mind, forsimpli ity, we willdrop index

i

inthesequelwriting

n

c

inpla eof

n

c

i

sti kingtotheusualnotation

w = Y ·c+ε

. To deal with identiability issues, we use Prin ipal Component Analysis (PCA) on the model (16) [Jollie, 1986, Nikereletal., 2006℄. To dete t non-identiableparameters,wede omposethedatamatrix

Y

usingSingularValue De omposition (SVD):

Y = U · diag(s

1 , s

2 , ..., s

n

c

) · V

T

(17) with

U ∈ R

q×q

and

V ∈ R

n

c

×n

c

orthonormalmatri esand

s

1 ≥ · · · ≥ s

n

c

≥ 0

thesingularvaluesof

Y

. Notethatthesumofsquaredsingularvaluesisequal to the square ofthe Frobenius norm of

Y

,

kY k

2 ₌

P

k

P

j

Y

j,k

2

, that is, in an equivalentstatisti alinterpretation,tothesumofthevarian esofallmetabolite on entrationsover

q

experiments(re allthatea h olumnof

Y

haszeromean bydenition).

Inpresen eofdependen iesamongdata,thereexistsanindex

r

with

1 ≤ r <

n

c

su hthat

s

r+1

= · · · = s

n

(23)

anytwove tors

w

and

c

su hthat

w = Y ·c

,thereexistsan

(n

c

−r)

-dimensional ve torspa e

K

Y

⊆ R

n−r

(thekernelof

Y

)su hthat

w = Y · (c + k

Y

)

alsoholds forany

k

Y

∈ K

Y

. For thepurpose ofidenti ation, thisimpliesthat

c

annot beuniquelyre onstru ted from the data. (The ase where

(s

r+1

, · · · , s

n

c

)

are onlyapproximately

0

willbedis ussedlaterin thisse tion.)

We rely on the observation that

K

Y

=

range

(V

r+1:n

c

)

, the ve tor spa e generated bythelast

n

c

− r

olumns of

V

. Inorder to formulate aregression problemwithawell-denedsolution,werewritemodel(16)intermsofaredu ed datamatrix

Y

∗

_{∈ R}

q×r

andaredu edparameterve tor

c

∗

_{∈ R}

r

asfollows:

w = Y

∗

_{· c}

∗

_{+ ε}

Y

∗

_{= Y · V}

1:r

(18) where

V

1:r

∈ R

n

c

×r

is the matrix obtained by extra ting the rst

r

olumns of

V

. Sin e

Y

∗

is full- olumn rank,

Y

∗

_{· c}

∗

is a linear ombination in

c

∗

of independentdatave tors. Thisensuresthatthesolutiontotheregression(18) isunique,hen ewe all

c

∗

`identiable'. Givenauniquesolution

c

∗

totheregression(18),thespa eof undistinguish-ablesolutionsfortheoriginalparameterve tor

c

in(16) anthenbedenedas follows:

c ∈ {V

1:r

· c

∗

+ k

Y

, k

Y

∈ K

Y

}

(19) Depending on thestru ture ofthe orthonormalmatrix

V

, wemaybeable to isolatesomeentries of

c

that anbeuniquely determinedfrom theredu ed model,that is,from theestimatesof

c

∗

. Thishappenswhenallelementsofat least one row of

V

r+1:n

c

are equal to

0

. Indeed, this is the riterion we used in Se . 5toisolateidentiableparametersinnonidentiablerea tions,su has rea tions17and19in Table1.

In pra ti e, singular values are rarely exa tly

0

, even in presen e of data dependen ies. This anbeduetoseveral auses,in ludingmeasurementnoise, numeri alroundos,et . Still,forsome

r < n

c

, valuesof

(s

r+1

, · · · , s

n

c

)

su- iently lose to

0

anmaketheestimates of

c

solvingregressionpoorly deter-mined. Thus a riteriontodis ardthosesingularvaluesneedstobedened.

Inthispaper,weapproximatebyzeroallsingularvalueswhosetotal ontri-bution tothe varian eof

Y

,i.e., tothe `informativity' ofthemetabolite data, isunder somesuitablethreshold

λ

,that is,wedene

r = min

(

t ∈ [1..(n

c

− 1)] ,

P

t

k=1

s

2 k

P

n

c

k=1

s

2 k

≥ λ

)

(20) and set

s

r+1

= · · · = s

n

c

= 0

. By this approximation, from (17) weobtaina datamatrix

Y

ofrank

r < n

c

. ThePCAmethoddes ribedbeforethenapplies. Theresultsdis ussedin Se .4wereobtainedwith

λ = 0.99

. Intheappli ation to the biologi al model of Se . 5 , the hoi e of

λ

was driven by the level of measurementnoise orrupting the data matrix

Y

. This wasquantied based on thevarian e ofthe data over theavailable repli as in a xedexperimental ondition[Ishiietal., 2007℄. Depending ontherea tion,the valueof

λ

ranged from

0.94

to

0.999

.

(24)

B Likelihood-based identi ation of linlog mod-els

Werelyonthenotationof Se . 3,i.e., wefo usonasinglerea tionand drop index

i

fromthenotation. Theloglikelihoodofthemodelis:

log L (c) = log

Z

f

W

|˘

y,˜

y,c

(w)f

Y

e

|˘

y,c

(˜

y)d˜

y

(21)

For onvenien e, werewrite(21)in termsoftherandomvariable

Z = e

Y · c

introdu edinSe . 3sothatitbe omes:

log L (c) = log

Z

f

W

|˘

y,z,c

(w)f

Z|˘

y,c

(z)dz.

(22)

Here

f

W

|˘

y,z,c

(·)

istheGaussianlikelihoodfun tionofmodel(7),equivalently rewritten as

W = ˘

Y · c + Z + ε

,given

Y = ˘

˘

y

and

Z = z

, with

z

varying over allpossiblevaluesof

Z

,and

f

Z|˘

y,c

istheGaussian priorof

Z = e

Y · c

following from(10). Theexpressionsof

f

W

|˘

y,z,c

and

f

Z|˘

y,c

arethus







f

W

|˘

y,z,c

(w) =

√

1 det(2πΣ

ε

)

exp(−

1

2 [w − ˘

Y · c − z]

T

Σ

−1

ε

[w − ˘

Y · c − z]),

f

Z|˘

y,c

(z) =

√

_det(2πΣ

1 ˘

y,c

)

exp(−

1

2 [z − µ

y,c

˘

]

T

Σ

−1

˘

y,c

[z − µ

y,c

˘

]),

(23) with

µ

y,c

˘

= M · c

, where theentry

M

j,k

of matrix

M

is the mean

µ

j,k

of the distributionof

Y

e

j,k

, and

Σ

y,c

˘

isthevarian ematrixof therandomvariable

Z

. Bytheindependen e assumptionson

Y

e

,itturnsoutthat

Σ

y,c

˘

= diag

n

c

X

k=1

c

2 k

· [σ

1,k

2 · · · σ

q,k

2 ]

T

!

.

(24) where

σ

j,k

isdenedin (10) .

Assumeforthemomentthat

Σ

y,c

˘

is invertible. Dening

f

p

c

(z) = f

W

|˘

y,z,c

(w) · f

Z|˘

y,c

(z),

(25) aftersimplebuttedious al ulations,weobtain

f

p

c

(z) = κ

f

c

· f

c

(z)

(26) with

f

c

thedensityfun tionofaGaussiandistribution

N

(µ

f

c

, Σ

f

c

)

and











Σ

f

c

= [Σ

−1

ε

+ Σ

−1

y,c

˘

]

−1

µ

f

c

= Σ

f

c

· [Σ

−1

ε

· (w − ˘

Y · c) + Σ

−1

y,c

˘

· µ

y,c

˘

]

κ

f

c

=

exp(−

1

2 [w− ˘

Y

·c−µ

_√

y,c

˘

]

T

·[Σ

ε

+Σ

y,c

˘

]

−1

·[w− ˘

Y

·c−µ

y,c

˘

])

det(2π[Σ

ε

+Σ

y,c

˘

])

.

(27)

Theproportionalityfa tor

κ

f

c

doesnotdepend ontheintegration variable

z

,soit an betakenoutoftheintegraland(22) anberewrittenasfollows:

log L (c) = log(κ

f

c

) + log

Z

f

c

(z)dz

(25)

Theintegralof anormalizedGaussian density fun tion being

1

, we nally haveananalyti alexpressionfortheloglikelihood:

log L (c) = log(κ

f

c

)

. The above results are used in the expe tation step of the EM algorithm. Re allthedenition

Q(c|ˆc

ℓ−1

) =

Z

log(f

Z,W

|˘

y,c

(z, w))f

Z|˘

y,ˆ

c

ℓ−1

_,w

(z)dz.

(29)

TheBayestheoremallowsustorewrite(29)asfollows:

Q(c|ˆc

ℓ−1

) =

Z

log(f

W

|˘

y,z,c

(w)f

Z|˘

y,c

(z))

f

W

|˘

y,z,ˆ

c

ℓ−1

(w)f

_Z|˘

_y,ˆ

_c

ℓ−1

(z)

f

W

|˘

y,ˆ

c

ℓ−1

(w)

dz.

(30)

Fun tion

f

W

|˘

y,ˆ

c

ℓ−1

(w)

does notdepend on

z

so it an be takenout of the integral. Moreover,thisfun tiondoesnotdependon

c

soitwillhavenoimpa t on the maximizationstep of EM. Thus, we an ignore this fun tion from the omputationoftheexpe tation fun tionabove.

Usingdenitions(25)and(26) ,we anrewrite(30)inthefollowingway:

Q(c|ˆc

ℓ−1

) ∝

Z

κ

f

_ˆ

_cℓ−1

f

ˆ

c

ℓ−1

(z) log(κ

f

c

f

c

(z))dz

∝

Z

f

ˆ

c

ℓ−1

(z) log(κ

f

c

f

c

(z))dz.

(31)

Wehavedroppedthe onstantfa tor

κ

f

ˆ

cℓ−1

asitdoesnotdepend on

c

and thusdoesnotinuen ethemaximizationstepofEM.Byrepla ing

log(κ

f

c

f

c

(z))

by

log(κ

f

c

f

c

(z)f

c

ˆ

ℓ−1

(z)/f

ˆ

c

ℓ−1

(z))

andseparatingtheintegrandinasumofterms, we an rewrite(31)as

−Q(c|ˆc

ℓ−1

) ∝

Z

f

ˆ

c

ℓ−1

(z) log

f

c

ˆ

ℓ−1

(z)

f

c

(z)

dz−

Z

f

ˆ

c

ℓ−1

(z) log(f

_c

_ˆ

ℓ−1

(z))dz−log(κ

f

c

).

(32) Were ognizein therstterm thedenition of the Kullba k-Leibler diver-gen e

KL(f

c

||f

ˆ

c

ℓ−1

)

betweenthetwoprobabilitydistributions

f

c

and

f

ˆ

c

ℓ−1

and in the se ond term the entropy

H(f

ˆ

c

ℓ−1

)

of

f

ˆ

c

ℓ−1

[CoverandThomas, 2006, StoorvogelandvanS huppen, 1996℄. ForGaussiandistributions, these anbe written expli itelyas

KL(f

c

||f

ˆ

c

ℓ−1

) =

1

2 (log

det(Σ

f

c

)

det(Σ

f

_cℓ−1

_ˆ

)

!

+ T r(Σ

−1

_f

_c

Σ

f

cℓ−1

ˆ

)

+ [µ

f

c

− µ

f

ˆ

cℓ−1

]

T

_Σ

−1

f

c

[µ

f

c

− µ

f

cℓ−1

ˆ

]),

(33) where

T r(. . .)

standsfortra eand

H(f

ˆ

c

ℓ−1

) = log

q

det(2πeΣ

f

_ˆ

_cℓ−1

)

.

(34)

Tosummarize,togetherwith(27) ,thisgivesustheexpli itformula

(26)

whi hweemployin ourimplementationofEM.

In more generality, for some values of

c

,

Σ

y,c

˘

may be singular or poorly onditioned. Toavoidthis ir umstan e,we anadaptourpro edureasfollows. We onsiderade omposition

W = ˘

Y · c + Z + (ε

′

+ ε

′′

) = ˘

Y · c + (Z + ε

′

) + ε

′′

(36) where

ε

′

and

ε

′′

areindependentzero-meanGaussianrandomve torssu hthat

Σ

ε

′

, V ar(ε

′

) = αΣ

_ε

and

Σ

ε

′′

, V ar(ε

′′

_{) = (1 −α)Σ}

ε

,with

α ∈ (0, 1)

atunable parameter. Sin e

Σ

ε

> 0

byassumption, it followsthat

Σ

ε

′

> 0

and

Σ

ε

′′

> 0

. Moreover,

Σ

ε

= Σ

ε

′

+ Σ

ε

′′

, i.e., thestatisti sof

ε

andof

ε

′

_{+ ε}

′′

areidenti al. Sin e

V ar(Z + ε

′

_{) = Σ}

˘

y,c

+ Σ

ε

′

> 0

, if we interpret

Z + ε

′

as the unknown observations(inpla eof

Z

)and

ε

′′

asthemodelnoise(inpla eof

ε

),weensure thatthevarian eofthe`missingdata'isinvertible. Thus,inpra ti e,weapply allformulasdevelopedabovewith

Σ

y,c

˘

+ Σ

ε

′

in pla e of

Σ

y,c

˘

and

Σ

ε

′′

in pla e of

Σ

ε

.

Theee tofthespe i hoi eof

α

isunderinvestigation. Inthiswork,we took

α = 0.2

,avaluethatleadstogoodresultsin pra ti e.

C Validation on syntheti data

Themodel usedfor omparingpeforman eoftheidenti ationalgorithmsisa redu edsyntheti linlogmodeloftheE. oli entral arbonmetabolismnetwork (Fig. 4). This network ontains 17 variables, des ribing internal and external metabolites,and25rea tions,summarizedinTable2andTable3,respe tively. Thelinlogmodelhastheform ofEq.(1)-(2).

Index Name Symbol

1 Pyruvate PYR 2 Phosphoenol-pyruvate PEP 3 Gly eraldehyde-3-phosphate G3P 4 Fru tose-6-phosphate F6P 5 Glu ose-6-phosphate G6P 6 3-phosphogly erate 3PG 7 Dihydroxya etonephosphate DHAP 8 Ribulose-5-phosphate Ru5P 9 Ribose-5-phosphate R5P 10 6-phosphoglu onate 6PG 11 Erythrose-4-phosphate E4P 12 Xylulose-5-phosphate X5P 13 2-phosphogly erate 2PG 14 1,3-diphosphosphogly erate 1,3DP 15 Fru tose-1,6-bisphosphate F1,6P 16 2-keto-3-deoxy-6-phosphoglu onate KDPG 17 Sedoheptulose-7-phosphate S7P Table2: Metabolitesin ludedinthesyntheti linlogmodel.

(27)

Index Name 1 Phosphotransferasesystem 2 Glu ose-6-phosphateisomerase 3 Glu ose-6-phosphatedehydrogenase 4 Phosphofru tokinase 5 Transaldolase 6 Transketolasea 7 Transketolaseb 8 Aldolase 9 Gly eraldehyde-3-phosphatedehydrogenase 10 Triosephosphateisomerase 11 Gly erol-3-phosphatedehydrogenase 12 Phosphogly eratekinase 13 Serinesynthesis 14 Phosphogly eratemutase 15 Enolase 16 Pyruvatekinase 17 PEP arboxylase 18 Pyruvatesynthesis 19 6-Phosphoglu onatedehydrogenase 20 Ribose-phosphateisomerase 21 Ribulose-phosphateepimerase 22 Ribose-phosphatepyrophosphokinase 23 Phosphoglu onatedehydratase 24 KDPGaldolase 25 Fru tosebisphosphatase