HAL Id: inria-00561999
https://hal.inria.fr/inria-00561999
Submitted on 2 Feb 2011
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
incomplete high-throughput datasets
Sara Berthoumieux, Matteo Brilli, Hidde de Jong, Kahn Daniel, Eugenio
Cinquemani
To cite this version:
Sara Berthoumieux, Matteo Brilli, Hidde de Jong, Kahn Daniel, Eugenio Cinquemani. Identification
of metabolic network models from incomplete high-throughput datasets. [Research Report] RR-7524,
INRIA. 2011. �inria-00561999�
a p p o r t
d e r e c h e r c h e
N
0
2
4
9
-6
3
9
9
IS
R
N
IN
R
IA
/R
R
--7
5
2
4
--F
R
+
E
N
G
Computational Biology and Bioinformatics
Identification of metabolic network models from
incomplete high-throughput datasets
Sara Berthoumieux — Matteo Brilli — Hidde de Jong — Daniel Kahn — Eugenio
Cinquemani
N° 7524
Centre de recherche INRIA Grenoble – Rhône-Alpes
Sara Berthoumieux
∗
,Matteo Brilli, Hidde de Jong ,DanielKahn , EugenioCinquemani
∗
Theme: ComputationalBiologyandBioinformati s
ComputationalS ien esforBiology,Medi ine andtheEnvironment Équipes-ProjetsIbisandBamboo
Rapportdere her he n°7524Janvier201132pages
Abstra t: High-throughputmeasurementte hniquesformetabolismandgene expression provide a wealth of information for the identi ation of metaboli network models. Yet, missing observations s attered overthe datasetrestri t thenumberofee tivelyavailabledatapointsandmake lassi alregression te h-niquesina urateorinappli able. Thoroughexploitationofthedataby identi- ationte hniquesthatexpli itly opewithmissingobservationsisthereforeof majorimportan e.
Wedevelopamaximum-likelihoodapproa hfortheestimation ofunknown parameters of metaboli network models that relies on the integration of sta-tisti al priorsto ompensateforthemissing data. Inthe ontextof thelinlog metaboli modeling framework,weimplementtheidenti ationmethod byan Expe tation Maximization (EM) algorithm and bya simplerdire t numeri al optimizationmethod. Weevaluateperforman eofourmethodsby omparison toexistingapproa hes,andshowthatourEMmethodprovidesthebestresults overavarietyofsimulateds enarios. WethenapplytheEMalgorithmtoareal problem, the identi ation of amodel for the Es heri hia oli entral arbon metabolism, basedon hallengingexperimental datafrom theliterature. This leadstopromisingresultsandallowsustohighlight riti alidenti ationissues. Key-words: metaboli network,parameterestimation,systemidenti ation, Expe tationMaximizationalgorithm,in ompletehigh-throughputdatasets, pro-teomi s,metabolomi s,kineti modeling,linlogmodels, entral arbonmetabolism ofEs heri hia oli
∗
données à haut-débit
Résumé : Les te hniques a tuelles de mesures à haut-débit pour le méta-bolisme et l'expression génique fournissent de très nombreuses données pour l'identi ation demodèlesderéseauxmétaboliques. Cependant, l'existen ede donnéesmanquantestoutaulongdujeudedonnéesrestreintlenombreee tif dedonnéesdisponiblesetrendleste hniques lassiquesderégressionimpré ises ouinappli ables. Ilestdon primordiald'utiliserdeste hniquesd'identi ation qui tiennent ompteexpli itementde es observationsmanquantes.
Nous développons une appro he basée sur le maximum de vraisemblan e pour l'estimation de paramètres de modèles de réseaux métaboliques. Elle repose sur l'intégration de distributions a priori pour ompenser les données manquantes. Nous implémentons ette méthode d'identi ation dansle adre delamodélisationmétabolismeparleformalismelinlogàl'aided'unalgorithme EM(Expe tation-Maximization)et d'uneméthodeplusdire te d'optimisation numérique. Nous évaluonsla performan e de nos méthodes en les omparant à des appro hes existantes et nous montrons que notre méthode EM produit lesmeilleursrésultatssurdiérentss énariossimulés. Nousappliquonsensuite l'algorithme EM à un problème réel, l'identi ation d'un modèle du métabo-lisme entraldu arbone hezEs heri hia oli, baséesur unimportant jeu de donnéesexpérimentalesdelalittérature. Lesrésultatsobtenussontprometteurs etnouspermettentdemettreenéviden e ertainsaspe ts ritiquesdesjeuxde donnéespourl'identi ation.
Mots- lés : réseaux métaboliques, estimation de paramètres, identi ation de systèmes, algorithme Expe tation Maximization, jeu in omplet dedonnées haut-débits, protéomique, métabolomique,modélisation inétique,modèle lin-log,métabolisme entral du arbone hezEs heri hia oli
1 Introdu tion
Tofurther ourunderstandingof the ellularpro essesshaping theresponse of mi robial ellsto hangesintheirenvironmentrequiresthestudyofthe intera -tionsbetweengeneexpressionandmetabolism. Inre entyearshigh-throughput datasets omprisingsimultaneousmeasurementsofmetabolism(uxes, metabo-lite on entrations) and gene expression (protein and mRNA on entrations) havebe omeavailable[Hardimanet al.,2007,Ishiietal.,2007℄. Thesedatasets provideari hstoreofinformationformodelingthedynami softhebio hemi al rea tion systemsunderlying ellular pro esses. In parti ular, they promiseto relievewhat is urrentlyabottlene kfor modeling in systemsbiology, obtain-ingreliableestimatesofparametervaluesin kineti models[Ashyraliyevet al., 2009,Crampin,2006℄.
Notwithstandingtheseexperimentaladvan es,parameterestimationremains aparti ularly hallengingproblem,amongotherthingsduetoin omplete knowl-edge of the mole ular me hanisms, noisy and partial observations, heteroge-neous experimental methods and onditions, and the large size of networks [Maru iet al., 2011℄. As a onsequen e, the models may not be identi-able, may not generalize to new situations due to overtting, and nonlinear rate fun tions may make them umbersome to analyze. This has led to the proposal of simplied kineti modeling frameworks, in luding linlog kineti s [VisserandHeijnen , 2003℄, loglin kineti s [HatzimanikatisandBailey, 1997℄, power-law kineti s [Savageau, 1976℄, and more re ently, onvenien e kineti s [LiebermeisterandKlipp,2006℄.
Linlogmodelsareaparti ularlyinteresting hoi eformodelingmetabolism [Heijnen , 2005, VisserandHeijnen , 2003℄. Simulation studies on the level of both individual enzymati rea tions [Heijnen, 2005℄ and metaboli networks [Costaetal., 2010, Hadli h etal., 2009, Visseret al., 2004℄ have shown that theyprovidereasonableapproximationsof lassi alenzymati ratelaws. More-over, a re ent genome-s ale linlog model of yeast metabolism, parametrized usingpreviously-publishedkineti models, hasbeenshownabletoidentify key stepsinthenetwork,thatis,rea tionsexertingmost ontroloverglu ose trans-portandbiomassprodu tion[Smallboneet al.,2010℄.
A majoradvantageof linlog models is that, when measurementsof uxes, enzyme on entrations, and metabolite on entrations are available, the pa-rameterestimationproblemredu estomultiplelinearregression. However,the performan eofregressionapproa hesqui klydegradesin thepresen eof miss-ing data,as isoften the asein high-throughputdatasets due toexperimental limitationsorinstrumentfailures.
Inorder to deal with this problem, weproposein this paper a maximum-likelihood method for the identi ation of linlog models of metabolism from in omplete datasets. The spe i ontributions of thepaperare twofold. On thetheoreti alside,wedevelopamethodfortheoptimizationofthelikelihood basedonExpe tationMaximization(EM)[Dempsteret al.,1977℄. Themethod is onstru tedforlinlogmodels,butismoregenerallyappli abletoother regres-sionproblems. Inparti ular,wederiveanalyti alexpressionsfortheexpe tation stepthatarewell-suitedfornumeri almaximization.Thisguaranteesthe appli- abilityoftheapproa hevenwhenmodelinglargenetworks. Weshowbymeans ofsimulationexperimentsonsyntheti datathatourapproa houtperformsboth regressionandareferen emethodfromstatisti alliteraturefordealingwith
in- omplete data, multiple imputation [Rubin , 1976, 1996℄. In omparison with earlierworkontreatingin ompletehigh-throughputdatasets[Obaetal.,2003, S holzet al.,2005℄,ouraimisnottoestimatethemissingvalues,butratherto improvetheestimationof themodelparametersfrom thein ompletedatasets. Thisisadierentproblemthatne essitatesthedevelopmentofnovelmethods. On the biologi al side, we apply the method to a linlog model of entral metabolism in E. oli, onsisting of some 16 variables. We estimate the 100 parametersofthismodelfromahigh-throughputdatasetpublishedinthe liter-ature[Ishiietal.,2007℄. Thedata onsistsofmeasurementsofmetaboli uxes andmetaboliteandenzymelevelsinglu ose-limited hemostatunder29 dier-ent onditionssu hawild-typestrainandsingle-genemutantstrainsordierent dilution rates. Standardlinear regressionis di ult to apply in this asedue tomissingdata,whi hdisqualiesfor7rea tionstoomanydatapoints,leaving adatasetofsize inferiortothenumberofparameterstoestimate. Appli ation of our approa h allows one to ompute reasonable estimates for most of the identiablemodelparametersevenwhenregressionisinappli able.
2 Parameter estimation in linlog models
Thedynami sofmetaboli networksaredes ribedbykineti modelshavingthe formofsystemsofordinarydierentialequations(ODEs)[Heinri handS huster, 1996℄:
˙x = N · v(x, u, e)
(1)where
x ∈ R
n
+
denotesthe ve torof(nonnegative)internal metabolite on en-trations,u ∈ R
p
+
theve torof externalmetabolite on entrations,e ∈ R
m
+
the ve torofenzyme on entrations,andv : R
n+p+m
+
→ R
m
theve torofrea tion ratefun tions.N ∈ Z
n×m
isastoi hiometrymatrix.
Therea tionrates
v
arenonlinearandgenerally omplexfun tions ofx
,u
, ande
,withmanykineti parametersthataredi ulttoreliablyestimatefrom the data. This has motivated the use of approximate ratefun tions, like the linear-logarithmi (linlog) fun tions onsidered in this paper [Heijnen , 2005, VisserandHeijnen , 2003℄. The linlog approximation expresses the rea tion rates as proportional to the enzyme on entrations and to a linear fun tion ofthelogarithmsofinternalandexternalmetabolite on entrations. Thisleads totherateequationv(x, u, e) =
diag(e) · a + B
x
· ln(x) + B
u
· ln(u)
(2) wherethelogarithmofave tormeanstheve toroflogarithmsofitselements. For on iseness,inthesequelweshalldropthedependen eof
v
on(x, u, e)
from thenotation. Anin-depthdis ussionoflinlogmodelsand omparisonwithother approximativeratefun tions anbefoundinthereviewby[Heijnen ,2005℄.We are interested in the estimation of the (generally unknown) parame-ters
a ∈ R
m
,B
x
∈ R
m×n
andB
u
∈ R
m×p
from
q
experimental datapoints(v
(k)
, x
(k)
, u
(k)
, e
(k)
)
,
k = 1, . . . , q
. That is, the data used for parameter esti-mation are parallel measurements of enzyme and metabolite levels as well as metaboli uxes. Noti ethatinpra ti erea tionratesaremeasuredat (quasi-)steadystate. Thedatapoints(v
(k)
, x
(k)
, u
(k)
, e
(k)
)
experimental onditions,forinstan edierentdilutionratesin ontinuous ul-turesordierentmutantstrains. Forthepurposeofparameterestimation,itis onvenienttorewrite(2)in theform ofaregressionmodel:
v
e
T
= [1 ln(x)
T
ln(u)
T
] ·
a
T
(B
x
)
T
(B
u
)
T
(3)where the ratio of two ve tors (here
v/e
) denotes elementwise division. Let us use an upperbar to denote the mean of aquantity overitsq
experimental observations,forinstan e:v/e = (1/q)
P
q
k=1
v
(k)
/e
(k)
. Bythelinearityof (3), itholdsthatv
e
= [1 ln(x)
T
ln(u)
T
] ·
a
T
(B
x
)
T
(B
u
)
T
(4)Thisallows(3)tobereformulatedasamean-removedmodel
v
e
−
v
e
T
=
ln(x) − ln(x)
ln(u) − ln(u)
T
·
(B
x
)
T
(B
u
)
T
(5)andweobtainthefollowingparameterestimationproblem: Problem 1 Given thedata matri es
v
(1)
e
(1)
−
v
e
T
. . .v
(q)
e
(q)
−
v
e
T
|
{z
}
,W
,
ln(x
(1)
) − ln(x)
T
ln(u
(1)
) − ln(u)
T
. . . . . .ln(x
(q)
) − ln(x)
T
ln(u
(q)
) − ln(u)
T
|
{z
}
,Y
nd parametersC
,
B
x
B
u
T
solvingthe regressionproblem
W = Y · C + ε
(6)where
ε ∈ R
q×m
ismeasurement noise on
W
.Noti ethattheparameterve tor
a
nolongerappearsintheregression prob-lem, butanestimateof it anbere overedfrom estimatesofC =
B
x
B
u
T
bywayofEq.(4).
Inthe remainderof the paper, we makethe assumption that ea h olumn
ε
·i
ofε
followsaGaussiandistribution,indi atedbyε
·i
∼ N (0, Σ
ε
i
)
,whereΣ
ε
i
is diagonal,i.e. the measurementerrorsin dierentexperimentsare mutually un orrelated. Wefurtherassumethat
ε
.i
isindependentofε
.j
fori 6= j
. Then, Problem 1 an be subdivided intom
independent subproblems, one for ea h rea tioni
:w
·i
= Y · c
·i
+ ε
·i
,
(7) wherew
·i
andc
·i
arethei
-th olumnsofW
andC
,respe tively.Thevalues ofthe parametermatri es
B
x
and
B
u
admit an interesting bi-ologi al interpretation. Noti ethat one animmediately ndvalues
x
0
∈ R
n
+
,u
0
∈ R
p
+
,e
0
∈ R
m
+
andv
0
∈ R
m
su h that
v
0
/e
0
= v/e
,ln x
0
= ln(x)
, andln u
0
= ln(u)
. As a onsequen e,Eq. (5) anbe rearrangedinto the ommon relativeformulationoflinlogmodels,v
e
=
diagv
0
e
0
1 + B
x
0
ln
x
x
0
+ B
u
0
ln
u
u
0
(8)where
1
is anm × 1
ve torofones,(v
0
, x
0
, u
0
, e
0
)
isaso- alledreferen estate [Heijnen ,2005℄andB
x
0
,B
u
0
arematri esofelasti ity onstants,whereB
0
x
=
diage
0
v
0
· B
x
,
B
u
0
=
diage
0
v
0
· B
u
.
(9)Theelasti ities,introdu edinthe ontextofMetaboli ControlAnalysis(MCA) [Heinri handS huster,1996℄,des ribethenormalizedlo alresponse ofthe re-a tionratesto hangesin metabolite on entrations. Theinterestis thatthey anthusbeimmediately omputedfrom thevaluesof
B
x
and
B
u
foundbythe solutionofProblem1,andtheequality
e
0
/v
0
= 1/(v/e)
.Althoughstraightforwardin theory,solving theregressionproblem (6) en- ounterstwo ompli ationsinpra ti e.
1. Sin ethe measurementsare arried outat (quasi-)steadystate, wehave
N · v(x, u, e) = 0
. Thisintrodu esdependen iesamongthedataandthus redu estheinformation ontentofthedatamatrixY
,inthesensethatY
be omesrankde ient. Likein earlierwork[Nikerel etal.,2006℄, weuse standardapproa hestosolvethisproblem. WenotablyrelyonPrin ipal ComponentAnalysis(PCA)[Jollie,1986,Nikerelet al.,2006℄appliedto thedatamatrixY
toredu ethemodelorder,i.e.,thenumberof indepen-dentparameters,andensurewell-posednessoftheregressionproblem(see AppendixAforte hni aldetails). Insummary,weuseSingularValue De- omposition(SVD),ate hniquede omposingthedatamatrixinto domi-nantandmarginal omponentsa ordingtoavarian e riterion. Forthe purposeoflinearregression,this orrespondstode omposing the param-eterve torintoaredu ednumberof omponentsthat anbedetermined with ertainty based on the data, while the remaining omponents are poorlydetermined,i.e., theyare`nonidentiable',andaredis ardedwith negligibleee t on the t. We note in passing that the olumns ofW
andY
arezero-mean,animportantrequirementforthe orre tnessofthe outlinedanalysis.2. The high-throughput datasets ontain a substantial amount of missing values,due toexperimentallimitationsorinstrumentfailures. If,forany givenrea tion,weonlyusedthedatapointsinwhi hallrelevantmetabolite on entrations,enzyme on entrations,andmetaboli uxesplayingarole inthat rea tionareavailable,then alargeamountofdatawouldhaveto be thrownaway. Inpra ti e,we would run theriskthat theparameters annotbereliablyidentied. Thedevelopmentofamethodthatis apable ofmaximallyexploitingtheinformation ontainedinin ompletedatasets forsolvingProblem 1is the main subje t of thepaper and willbe fully developedinthelaterse tions.
3 Likelihood-based identi ation of linlog models from missing data
Foreveryrea tion
i
, weare on ernedwiththeproblem ofestimating the un-known parametersc
·i
of the model given in (7) in the ase where some en-tries ofY
are unknown. We address the estimation problem by a likelihood-maximizationapproa h, whi h isknown to yield optimal(unbiased and mini-mum varian e)estimates for ourproblem setting in the ase whereY
is fully known. As theproblemis identi al forallrea tionsi
, in theremainder ofthe se tionwewilldropforsimpli ityindex·i
fromthenotation.Let
I
bethesetofindi es(row, olumn) orrespondingtotheknownentries ofY
,i.e.,(j, k) ∈ I
ifandonlyifY
j,k
isavailable. Itis onvenienttointrodu e thede ompositionY = ˘
Y + e
Y
,where˘
Y
j,k
=
(
Y
j,k
,
if(j, k) ∈ I ,
0,
otherwise;
e
Y
j,k
=
(
0,
if(j, k) ∈ I ,
Y
j,k
,
otherwise.
Matrix
Y
˘
isfullydetermined: On emeasurementsy
˘
ofY
˘
are olle ted,wetreat˘
Y = ˘
y
as xed parameters of the regressionproblem. MatrixY
e
olle ts the unknownentriesofY
. Wemodelthesemissingdataasunobservedindependent randomvariables,whosepriordistributionsen odeourgeneri knowledgeabout them. Assumingthatthea-priori distributionsarenotknown(worst ase),we deneaGaussianpriorforea hquantitythatismissinginanexperimentbased on the measurements of the same quantity available from other experiments. Forevery(j, k) 6∈ I
andY
j,k
= {Y
j
′
,k
: (j
′
, k) ∈ I }
(assumednonempty),we let
e
Y
j,k
∼ N (µ
j,k
, σ
j,k
2
),
µ
j,k
=
mean(Y
j,k
),
σ
j,k
=
std(Y
j,k
).
(10)We annowformulatetheestimationproblem.
Problem 2 Given measurements
W = w
andY = ˘
˘
y
, ompute the estimateˆ
c = arg max
c
log L (c)
, withL
(c) = f
W
|˘
y,c
(w)
, where, for anyc
,f
W|˘
y,c
(·)
is the probability density fun tion ofW
givenY = ˘
˘
y
orresponding to model (7) (10) .Notethat
L
(c)
isalikelihoodfun tionforalinearmodelwithmissingdata, in the sense that it is dened with respe t to available dataY
˘
only. One an expressL
(c)
bymarginalization,log L (c) = log
Z
f
W
|˘
y,˜
y,c
(w)f
Y
e
|˘
y,c
(˜
y)d˜
y,
(11)where
f
W
|˘
y,˜
y,c
(·)
is the standardlikelihood fun tion for model (7) givenY =
˘
˘
y
andy
˜
, withy
˜
varying over all possible values ofY
e
, andf
e
Y
|˘
y,c
= f
Y
e
|˘
y
is determined bythe prior(10). Theexpli itsolution tothe integralisreported inAppendixB. Adire tapproa htosolvingProblem2istomaximize(11)by numeri al optimization. However,the fun tion is not onvexinc
, when e itsdire t optimization isproneto endupin lo al minimaand requiresthe useof globaloptimizationstrategies.
Alternatively,weproposetota kleProblem2byanExpe tation-Maximization (EM)algorithm[Dempsteret al.,1977℄. EMprovidesageneralmethodologyfor theoptimization ofalikelihoodfun tion with missinginformation. It isbased onaniterativetwo-steppro edurethat,fortheproblemathand,weimplement as follows. Let us dene the random variable
Z = e
Y · c
, so that model (7) be omesW = ˘
Y · c + Z + ε
. Note thatZ ∼ N (µ
y,c
˘
, Σ
y,c
˘
)
, where for any givenc
,meanandvarian e anbederivedfrom(10). Letˆ
c
0
beaninitialguess of
c
. Atevery iterationℓ = 1, 2, 3, . . .
, ompute an updatedestimateˆ
c
ℓ
from theestimate
ˆ
c
ℓ−1
atthepreviousiterationbyperformingthefollowingEMsteps: Expe tation: Compute
Q(c|ˆc
ℓ−1
) = E[log f
Z,W|˘
y,c
(Z, w)|˘y, ˆc
ℓ−1
, w]
=
Z
log f
Z,W
|˘
y,c
(z, w)f
Z|˘
y,ˆ
c
ℓ−1
,w
(z)dz;
(12) Maximization: Solve
ˆ
c
ℓ
= arg max
c
Q(c|ˆc
ℓ−1
).
(13) In (12),f
Z,W
|˘
y,c
is the joint probability density fun tion ofZ
andW
given˘
Y = ˘
y
andc
, whilef
Z|˘
y,ˆ
c
ℓ−1
,w
is the probability density fun tion ofZ
given˘
Y = ˘
y
,W = w
andˆ
c
ℓ−1
. Infa t,thisquantityisindependentof
w
and anbe omputedbyourdenition (10).It anbeproventhat,at everyiteration
ℓ
, theEMalgorithmin reasesthe valueofL
(ˆ
c
ℓ
)
,andeventually onvergestoamaximumof
L
[Littleand Rubin, 2002℄. While this is notne essarily aglobal maximum, EM has proven ee -tivein many appli ations[Graham,2009,Hortonand Kleinman,2007℄. A key propertyisthat onvergen etoamaximumisa hievedevenif(13)isnotsolved exa tly: Itsu esthatˆ
c
ℓ
issu hthat
Q(ˆ
c
ℓ
|ˆc
ℓ−1
) ≥ Q(ˆc
ℓ−1
|ˆc
ℓ−1
)
,whi hiseasily a hieved even by alo al optimization algorithm. In pra ti e, we anuse the expli itexpressionof
L
inProblem2forstoppingtheiterations,e.g.,whenthe relativeimprovementonL
fallsbelowaspe iedthresholdτ > 0
:|L (ˆc
ℓ
) − L (ˆc
ℓ−1
)|/|L (ˆc
ℓ
)| ≤ τ.
To ompletethe implementation ofthealgorithm, onemust express
Q(c|ˆc
ℓ−1
)
in aform onvenientformaximization. As explainedin Appendix B, wewere abletoexpress(12)asanexpli itfun tion of
c
foranygivenˆ
c
ℓ−1
. In ompa t form:
Q(c|ˆc
ℓ−1
) ∝ −KL(f
c
||f
ˆ
c
ℓ−1
) − H(f
ˆ
c
ℓ−1
) + log(κ
f
c
),
(14) wheref
c
standsforaGaussiandistributionwithvarian eΣ
f
c
= [Σ
−1
ε
+ Σ
−1
˘
y,c
]
−1
and meanµ
f
c
= Σ
f
c
· (Σ
−1
ε
· (w − ˘y · c) + Σ
−1
˘
y,c
· µ
y,c
˘
)
,κ
f
c
is a fun tion de-pendingonc
viaµ
f
c
andΣ
f
c
,and theproportionalityfa torthat wedropped (indi ated bythe presen e of∝
in pla e of=
)depends onˆ
c
ℓ−1
but noton
c
. Finally,KL(·||·)
andH(·)
are theKullba k-Leiblerdistan e between distribu-tions and the entropy of a distribution, respe tively, for whi h, in the Gaus-sian ase at hand, expli it formulas are available [CoverandThomas, 2006,StoorvogelandvanS huppen,1996℄. Aslightte hni al ompli a yisneededin ase
Σ
y,c
˘
issingular(seeAppendix Bforallthemathemati aldetails).Theavailability of the losed-form expression (14)allows us to implement EM e iently, i.e., with an expli itmaximization problem that is solved nu-meri allyat alliterations. On etheparameterestimatesare obtained,several methods from the literature an be used to assess the a ura yof the results by inferring onden e intervals. Examples are randomized methods su h as bootstrapping[Manly, 1997℄andtheprolelikelihoodmethod by[Raueet al., 2009℄. Thismethodderives onden eintervalsusingathresholdonafun tion alled theprolelikelihood. Inourappli ation, this isobtainedseparately for ea h parameter
c
j
by re-maximization of (11) with respe t to all parametersc
k6=j
,forallvaluesc
j
inaneighborhoodofc
ˆ
j
.4 Validation on syntheti data
Beforeapplying theEM algorithmtoa tual biologi alidenti ation problems, wetesttheperforman e ofthemethod onsimulateddata. Forthis purpose,a syntheti modelhasbeendeveloped,asimplied variantof thelinlogmodelof E. oli entralmetabolismstudiedinSe .5below. Themodel,intheform(2), ontains17variables,representinginternalandexternalmetabolitesinvolvedin 25rea tions,and78parameters(seeAppendixCforthemodelequations). We generatedatamatri es
Y
from thismodelbymeansofsimulation,fordierent per entagesofmissingdataandexperimentalnoise. Usingthemodelstru ture andthesimulateddata,wesolveProblem1forea hrea tionindependently,as des ribedin Se .3.Inorder to assess the added valueof our spe i implementation of likeli-hood optimization, we rst ompare theperforman e of the EM algorithm of Se .3withthedire t maximizationoftheloglikelihood(11)implementedwith ageneral-purposeMatlab optimizationroutine. Thismethod willbereferred toasMaxLLin thesequel.
Se ond,we omparethelikelihood-basedidenti ationapproa heswith stan-dardmethods,notablylinearregression(referredtoasRg)andthe ommonly-used multipleimputation (MI)method[Rubin, 1976,1996℄. Regressionis per-formedbasedonfulldatasetsonly,i.e.,itdoesnot onsideran experimentally-determined datapoint
(v
(k)
, x
(k)
, u
(k)
, e
(k)
)
when at least one of the measure-mentsismissing. MI isbasedonimputation ofmissingdatabyrandomdraws of themissing values, i.e., non-zeroelementsof
Y
˜
, from thea-priori distribu-tion dened in (10). Both methods thus exploit onlypart of the information ontainedinanin ompletedatasetandprovidealowerlimitforquantifyingthe performan eofthemethodsproposedin Se .3.Third, we ompare the results of EM with the least-squares identi ation of the model on omplete datasets (a method referred to as RgF). Though inappli abletorealdatawithmissingmeasurements,themethodisstatisti ally optimal. Hen e,itprovidesus anupperperforman e bound that anbe used to assesstherole ofmissing datain performan edegradation,separatelyfrom theroleofnoise.
Mostof thehigh-throughput datasets available in the literaturehavebeen obtainedwhen metabolismis at (quasi-)steady-state. Inorder to mimi avail-ableexperimentaldataas loselyaspossible,simulateddataobtainedfromthe
syntheti model should therefore be steady-state data. We generated steady statesof(1)-(2) ,andre ordedthe orrespondingmetabolite on entrationsand metaboli ux valuesfor 30 dierent onditions, ea h onsisting of a random hangeintheenzyme on entrationwithrespe ttoareferen evalue.
We omparedperforman eofthevemethodsdes ribedabove(EM,MaxLL, MI,Rg,RgF)ondatasetswithdierentamountsofmissingdata(40%and75%) for themetabolite on entrations andnoise levels(10% and 20%) for
w
. The onlydieren ewiththedatasetusedforthereferen emethod RgFisthat the latter has no missing data. A noiselevelof 10%means that the distribution usedto generatethenoisehasastandarddeviationequalto10%ofthevalues inw
. Theper entagesof missingdata inthesimulationstudy are omparable tothoseobservedinpra ti e(Se .5and[Ishiiet al.,2007℄). Foreverydierent ombinationofmissingdataper entageandnoiselevel,adatasetwasgenerated byhomogeneouslydistributingmissingdataamong olumnsofY
,theindi esfor ea h olumn being hosenat random. Foreverysimulateds enario,randomly generatednoisewasaddedtow
inthedataset.For all the above s enarios, identi ation of ea h rea tion was addressed separately, in a ordan e with the dis ussion of Se . 2. For every rea tion, werst tested the identiability of the syntheti linlog model by PCA of the full datamatrixY. Inoursimulation,9rea tionsoutof the25 omposing the modelweredete tedashavingnonidentiableparameters. Forthoserea tions, identi ationofaredu edordermodel
w = Y
∗
· c
∗
+ ε
(15)wasperformedin pla e oftheidenti ationof theoriginalmodel.
Y
∗
∈ R
q×r
, with
r ≤ n+p
,isaredu ed-orderdatamatrixobtainedbylineartransformation ofY
andc
∗
∈ R
r
isaparameterve tor,smallerthan
c
,thatis`identiable',in thesensethat itiswelldeterminedbythedata(seeAppendixC).WeimplementedthedierentparameterestimationalgorithmsinMatlab, usingthels ovfun tionfortheregression-basedmethodsandfminsear hfor globaloptimizationinMaxLLandthemaximizationstepinEM.BothEMand MaxLLrequireaninitialguessoftheparameterstobespe ied. Weproposed 10 dierent initial parameterve tors, in luding the estimation obtained with the baseline method Rg where available. In order to draw statisti s for the estimation performan e, ea hofthevealgorithms wasappliedon100Monte Carlorepetitionsoftheidenti ationproblem. The ompleteperforman etest over all methods, onditions and 100 repetitions took about 7 h 40 min in Matlab 7.4.0onaLinuxPCworkstation(1862MhZ,2GbRAM).
Themostinformativeresultsfromallidenti ationmethodsaresummarized byboxplotsoftheratiooftheestimatedparametervalues
c
overthereferen e parametervaluesc
ref
usedto simulate the data, the losertheratio to1
, the better the estimates. Ensemble statisti s are drawn for all parameters orre-sponding to the samerea tion. Fig. 1 is dedi ated to the s enario with 40% missing data and 10%noise,whereas Fig.2reportson 75%missing data and 20%noise. Completeresultsforallrea tionsunderall onditions anbefound in AppendixC.Sin e the individual rea tions of the model involveonly a small subset of metabolites,ea hofthe
m
identi ationsubproblems onsistsoftheestimation ofalimitednumberofparameters,mostly2or3. Forthe asewith40%missingFigure 1: Statisti s of estimated parameter values for datasets with 40% of missing dataand 10%noise. Theresultsareshownasboxplotsof theratioof theestimated parametervalues
c
andreferen e parametervaluesc
ref
. Statis-ti shavebeen omputedforea hofthe5methodsfrom100datasets. Forea h method,theredlinedisplaysthemedianandthelowerandupperbluelines rep-resentthelowerandupperquartile values,respe tively. Whiskersextendfrom ea hendoftheboxtothemostextremevalueswithin1.5timesthe interquar-tile range from the ends of the box and outliers are shown with red rosses. ThetestedalgorithmsareExpe tationMaximization(EM),dire toptimization of loglikelihood (MaxLL), multiple imputation (MI), regressiononin omplete datasets (Rg) andregressionon ompletedatasets (RgF). (AD)Boxplotsfor rea tions3,4,11and 18of thenetwork,respe tively.Figure 2: Statisti s of estimated parameter values for datasets with 75% of missingdataand20%noise.Thegraphi alnotationsarethesameasforFig. 1. (AF)Boxplotsforrea tions3,13,17,22,19and22ofthenetwork,respe tively.
data,Rg anthereforebeperformedinallrunsforeveryrea tionofthemodel. On the ontrary, with
75%
missing data, regression annot be applied to 6 rea tionswhi hisapparentfromtheabsen eoftheRgstatisti sfor2rea tions in Fig.2.In omparisonwiththeother methods, multiple imputation(MI)givesthe worstresults(largest bias)in 3outofthe4rea tionsshownin Fig. 1, andin 5out of6rea tionsin Fig. 2. In rea tions11 of Fig. 1and22 of Fig. 2,the relativelysmallbiasesarea ompaniedbyanestimationun ertaintywiderthan forEMandMaxLL.This ouldbeexplainedbyarestri teduseofinformation ontainedinthedistributionofmissingdata. Indeed,MIonly onsidersrandom draws from the distribution while EM and MaxLL are based on all possible valuestakenbymissingdata throughintegrationofthedistribution.
Analysis of Fig. 1 reveals that, for 40% missing data and 10% noise, the performan e of EM and MaxLL is almost identi al and similar to that of re-gression(RgandRgF),withlimitedimprovementsonRg,i.e., slightlysmaller variability. Insome ases, su h asfor rea tions 11 and 18, their performan e approa hestheoptimal, unattainable bound providedby RgF, i.e., theyhave similarbiasandvariability.
Performan e improvements of likelihood-based methods over Rg be ome more signi ant when identi ation is performed on the dataset with higher per entage of missing data and larger noise. Fig. 2A-D show results for re-a tions where Rgwasappli able. Both EM and MaxLL substantiallyredu e estimation variabilityin rea tions3,17 and 22. Atthesametime, due to the largeramountofmissingdata,performan elosswithrespe ttoRgFismore sig-ni ant. Turnedanotherway,thisshowsthea ura ythat ouldbere overed werealldatasets omplete.
Fig. 2E-F showtheresultswhen Rgfailsto produ eestimatesand annot beusedtoinitializeEMandMaxLLoptimization. Still,EMprovidesestimates of therightorder of magnitudeand, for the ase ofFig. 2E, of therightsign in at least 75%of the runs (box entirely above
0
), while the median has the rightsignandisreasonably loseto1. Theestimationofthesignprovidedby MaxLLislessreliable(box rossing0
).Overall,we on ludethattheEM-basedapproa hprovidesthemosta urate estimatesunder allsimulated onditions. Wewill thereforeapply thismethod to theidenti ationofthelinlogmodel ofana tualmetaboli network froma published high-throughputdataset.
5 Appli ation to entral metabolism in E. oli Thenetworkof entral arbonmetabolisminEs heri hia oli hasbeenstudied foralongtimefromdierentperspe tives,whi hmakesitanidealmodelsystem forourpurpose. Aratherpre iseideaofthestru tureofthenetworkexists, sev-eral kineti modelsof thenetworkdynami s areavailable ([Bettenbro ket al., 2005,Kotteetal.,2010℄andreferen estherein),andre entlyahigh-throughput dataset ontaining the required information for solving Problem 1 has been published [Ishiietal., 2007℄. The network we onsider here gathersenzymes, metabolites and rea tions that make up the bulk of E. oli entral arbon metabolism, in luding gly olysis, the pentose-phosphate pathway, the
tri ar-boxyli a id y leandanapleroti rea tionssu hasglyoxylateshuntand PEP- arboxylase(Fig.3).
Thedatasetused for identi ation ofthis network wasobtained by exper-iments with 24 single-gene disruptants that were grown at a xed dilution rate of 0.2 h
−1
in glu ose-limited hemostat and on wild-type ells at 5 dif-ferentdilution rates[Ishiietal., 2007℄. Theauthors olle teddata using mul-tiple high-throughput te hniques, in parti ular DNA mi roarray analysis and two-dimensional dierential gel ele trophoresis (2D-DIGE) for genes and pro-teins, apillary ele trophoresistime-of-ightmass spe trometry (CE-TOFMS) formetabolites,andmetaboli uxanalysis.Theythusobtainedadataset on-sistingofmetabolite on entrations,mRNAandprotein on entrationsforthe enzymes, and metaboli uxes under 29 dierent experimental onditions. A largenumber ofdierentmetabolites were measured in theexperiments, with missing data in varying amounts, from 0to 80% of the observations, 28%on averageforthemetabolites onsideredbelow.
Fromtherea tionslistedin[Ishiiet al.,2007℄, wehave onstru tedalinlog modeloftheform(2),with
n = 16
internalmetabolites,q = 7
external metabo-litesandmeasured ofa tors,andm = 31
rea tions(seeAppendixD). Ea hof therea tionsis atalyzedbyasingleenzyme,whi hmaya tuallystandfor sev-eralenzymesinthe aseofisoenzymes,enzyme omplexes,orlumpedrea tions. Rea tionshavebeensimpliedorlumpedtogetherwhenasharedmetabolitehas not beenmeasured, whi h pre ludesestimation ofthe orresponding elements in the parameter matri esB
x
and
B
u
. In omparison with an earlier linlog model of E. oli entral arbon metabolism [Visseretal., 2004℄, weextended thes opeto in lude thetri arboxyli a id y leandthe glyoxylateshunt,but dueto theabove-mentionedsimpli ationsourmodelis more oarse-grained.
Anidentiabilityanalysis wasperformedbyseveralroundsofmissing data imputationusingthea-priori distributiondened inEq.(10)andPCA,whi h ledinea h asetothesameresult: 7outof31rea tionsweredete tedashaving nonidentiableparameters. Forthoserea tions,themodelhasbeenredu edas des ribedinEq.(15)usingadatamatrix
Y
ompletedbythemeansµ
j,k
ofthe a-priori distributions. Foreveryindividual rea tion,the redu edmodel hasa parameterve torc
∗
thatisnowentirelyidentiable.
Apartfromthedistributionofthea-priori missingdata, givenbyEq.(10), appli ationoftheEMrequiresinformationaboutthedistributionof
ε
,theerror on theratios of uxesand enzyme on entrations. The Ishiidataset provides several repli ameasurementsforareferen eexperimental ondition: wild-type ellsgrownonglu ose-limited hemostatwithadilutionrateof0.2h−1
. These data were used for the omputation of the varian e of
ε
. Running the EM methodonthemodelandthedatatookabout220susingtheimplementationof Se .4.Inordertoassessthea ura yoftheestimatedB
x
and
B
u
,we omputed forea hparametera95% onden einterval,bymeansoftheprolelikelihood method outlinedin Se .3. The omputationof the onden e intervalsforall parametersrequiredabout23min.
Contraryto thesimulationstudies reported in Se .4, areferen e or`real' model for the evaluation of the results doesnot exist in this ase. However, a-priori bio hemi alknowledgeonthesignsof theelasti itiesisavailable,i.e., elasti ities are positive for substrates and negative for produ ts. This infor-mation anbe omparedwith theestimated signsoftheelasti ities, andtheir onden eintervals, omputedfrom theparametermatri esusingtherelations
Figure 3: S heme of Es heri hia oli entral arbon metabolism. This map, showing metabolites (bold fonts) and genes (itali ) is adapted from [Ishiiet al., 2007℄. Abbreviations of metabolites are glu ose (Gl ), glu ose 6-phosphate(G6P),fru tose6-phosphate(F6P),fru tose1-6-biphosphate(FBP), dihydroxya etone phosphate (DHAP), gly eraldehyde 3-phosphate (G3P), 3-phosphogly erate (3PG), phosphoenolpyruvate (PEP), pyruvate (Pyr), 6-phosphoglu onate(6PG),2-keto-3-deoxy-6-phospho-glu onate(2KDPG), ribu-lose 5-phosphate (Ru5P), ribose 5-phosphate (R5P), xylulose 5-phosphate (X5P), sedoheptulose 7-phosphate (S7P), erythrose 4-phosphate (E4P), ox-aloa etate (OAA), itrate (Cit), iso itrate (IsoCit), 2-keto-glutarate (2KG), su inate-CoA(Su oA), su inate(Su ), fumarate(Fum), malate(Mal), gly-oxylate(Glyox),a etyl-CoA(A oA),a etylphosphate(A p)anda etate(A e). Cofa torsimpa tingtherea tionsarenotshown. Thegenenamesareseparated by a ommain the ase ofisoenzymes, bya olon forenzyme omplexes, and by a semi olon when the enzymes atalyze rea tions that have been lumped togetherin themodel.
in Eq. (9) . The results are shown in Table 1. Similar unshown results are obtainedbymeansoftheMaxLLmethod.
WeobservethattheEMmethodobtainsestimatesforallrea tions,in luding the7 aseswheretheinsu ientamountofdatamaderegressionnotappli able. However, 27 of the 100 non-zero elasti ities of the model are not identiable from this dataset. Moreover,outoftheremaining 73elasti ityestimates,half of them havesigns that are not statisti ally signi ant, in the sense that the 95% onden e interval straddles0. This ismost likely due tothe magnitude of noise in metabolite on entrations being omparable to the magnitude of relevantinformation,asforexampleforPEPwherethestandarddeviationover all experimental onditions equals the standarddeviation of the repli ates in asingle ondition(
0.06
mMvs0.05
mM). Thispre ludestheestimation ofan unambiguoussign.Oftheelasti itieswithstatisti allysigni antsigns,20outof37are orre t, inthesensethattheyhavetheexpe tedpositiveornegativesign. Theremaining elasti ities,distributedover11rea tions,arein orre tlyestimated. Letusnow dis uss what we believe are potential sour es of these errors, an information that ouldbeusedtosingleouterroneousestimatesa-priori.
Werstnotethatfor5ofthese11rea tions(GapA;Pgk,GltA,PrpC,Mdh, Edd;EdaandPta;A kA,A kB,seeTable1),onlyveryfew ompletedatapoints are available (between 1 and 5) and regressionmostly fails in these ases. In addition, allofthese rea tionsinvolveat leastone metabolite missingin more than70%oftheexperimental onditions. The ombinationofveryfew omplete datapointsandahighper entageofmissingmetabolitemeasurementsobviously makesmodelidenti ationextremelydi ultanditisfairtosaythatherewe rea h the limitof theappli ability of our method, orof any method for that matter,due tothela kofdata.
Se ond,4rea tionsareknowntooperate losetoequilibrium: Pgi,FbaA,FbaB, TpiAandGpmA,GpmB;Eno[Visseret al.,2004℄. Theoreti ally,theserea tions arenotidentiable,astheirelasti itiesarenotindependent[Visseret al.,2004℄, but PCA did notdete t this. Most likely, this is due to the above-mentioned noise in metabolite on entrations,whi h de reases their orrelations. A au-tious,preemptivestrategywouldbetoredu ethemodelforanyrea tionknown tobe losetoequilibriumandeliminatethe orrespondingdependentvariables. The errors in the signs of some elasti ities in the remaining 2 rea tions (PtsG andPfkA,PfkB) are less straightforwardto explain. It is unlikelythat they anbe attributed to theEM method, giventhat regressionis appli able herewitharelativelylargenumberof ompletedatapointsavailable(
14
and18
, respe tively) andgivesthesameresults. Alternatively,they maybeexplained by a modeling error or a hidden variable, for instan e an unknown ofa tor, biasingtheestimationresults. Itisalsopossiblethattheapproximationsofthe linlogmodelarenotsuitableforthese rea tions,forinstan ebe ausethereare large variations in metabolite on entrations between onditions, driving the systemfarfromthereferen estate.Insummary,EMgivesreasonableresultsforafairly ompli atedmodelon a hallenging dataset. Even though some puzzling issues remain, we believe thatthese anbesafelyattributedtotheinherentdi ultyoftheidenti ation problem.
ation of metab oli mo dels fr om in omplete datasets 17
Table 1: Elasti ity matrix
[B
0
B
0
]
estimated by EMfrom the data ofIshiiet al.[2007℄ for thelinlog model of E. oli entral arbon metabolism (the olumns of the matrix have been permuted for readability). Unidentiable elasti ities are shown in grey, un ertain elasti ities,i.e., havingasign thatis notsigni antwith95% onden e, in yellow, and orre tly/in orre tlyidentied elasti ities,i.e., havingasignthatissigni antwith95% onden e,ingreen/red. AbbreviationsareasinFig. 3. Someof the ofa torsaremodeledas ratiosofmetabolite on entrations,e.g. ATP/ADP.Rea tion27,labeledµ
,isaphenomenologi alrea tionforbiomassprodu tion. The lastrowindi ates theper entageof missingdatapermetaboliteand theright-most olumn displaystheamountof omplete datapoints availableforea hrea tion. Forrea tionslabeledwith∗
,regressionwasnotabletoprodu eanyresult.
`
`
``
``
`
``
`
``
Enzyme MetaboliteGl PEP G6P Pyr F6P FBP DHAP 3PG A oA oA
6PG Ru5P R5P S7P 2KG Su Fum Mal ATP ADP
Cit NADPH NADP
NADH NAD
FAD A e # omplete datapoints 1 PtsG 0.29 -0.88 0.79 1.88 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 2 Pgi 0 0 -0.33 0 0.23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27 3 PfkA,PfkB 0 -0.13 0 0 -0.11 -0.23 0 0 0 0 0 0 0 0 0 0 0 0.5 0 0 0 0 0 18 4 FbaA,FbaB 0 0 0 0 0 -0.29 0.07 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 5 TpiA 0 0 0 0 0 0 -0.07 0.22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 6 GapA;Pgk 0 0 0 0 0 -0.18 0 -0.05 0 0 0 0 0 0 0 0 0 0.32 0 0 -0.17 0 0 5 7 GpmA,GpmB;Eno 0 0.26 0 0 0 0 0 -0.12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 8 PykA,PykF 0 0.19 0 0.43 0 -0.14 0 0 0 0 0 0 0 0 0 0 0 0.11 0 0 0 0 0 11 9 A eE:A eF:LpdA 0 0 0 0.05 0 0 0 0 -0.05 0 0 0 0 0 0 0 0 0 0 0 0.27 0 0 6 10 Zwf;Pgl 0 0 -0.1 0 0 0 0 0 0 -0.21 0 0 0 0 0 0 0 0 0 -1.38 0 0 0 2 11 Gnd 0 0 0 0 0 0 0 0 0 0.33 -0.08 0 0 0 0 0 0 0 0 -0.76 0 0 0 2
∗
12 Rpe 0 0 0 0 0 0 0 0 0 0 0.46 0 -0.39 0 0 0 0 0 0 0 0 0 0 28 13 RpiA,RpiB 0 0 -0.64 0 0 0 0 0 0 0 0.26 -0.1 0 0 0 0 0 0 0 0 0 0 0 18 14 TktA 0 0 0 0 0 0 0 0 0 0 0 0.01 -0.1 0 0 0 0 0 0 0 0 0 0 18 15 TalA,TalB 0 0 0 0 -0.25 0 0 0 0 0 0 0 0.29 0 0 0 0 0 0 0 0 0 0 27 16 TktB 0 0 0 0 -0.58 0 0 0 0 0 0.35 0 0 0 0 0 0 0 0 0 0 0 0 27 17 GltA,PrpC 0 0 0 0 0 0 0 0 0.22 0 0 0 0 0.01 0 0 0 0 0.49 0 -0.02 0 0 1∗
18 A nA,A nB 0 0 0 0 0 0 0 0 0 0 0 0 0 0.99 0 0 0 0 0.55 0 0 0 0 5 19 I dA 0 0 0 0 0 0 0 0 0.07 0 0 0 0 0.002 0 0 0 0 0 0.001 0 0 0 4 20 Su A:Su B:LpdA;Su C:Su D 0 0 0 0 0 0 0 0 0 0 0 0 0 1.26 0.3 0 0 0 0 0 -0.48 0 0 2∗
21 SdhA:SdhB:SdhC:SdhD 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.08 -0.44 0 0 0 0 0 0.4 0 21 22 FumA,FumB,FumC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.46 0.1 0 0 0 0 0 0 25 23 Mdh 0 0.15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.01 0 0.31 0 -0.12 0 0 3∗
24 Pp ;P kA 0 0.29 0 0 0 -0.13 0 0 0 0 0 0 0 0 0 0 0.1 -1.15 0.21 0 0 0 0 9 25 MaeB,Sf A 0 0 0 0.08 0 0 0 0 -0.2 0 0 0 0 0 0 0 0.26 0 0 -0.003 0.04 0 0 2∗
26 A eA;A eB 0 0 0 0 0 0 0 0 -0.11 0 0 0 0 0 -0.01 0 -0.02 0 0 0 0 0 0 25 27µ
0 0.16 -0.11 0.06 -0.04 0 0 0.11 0.1 0 0 0.09 0 -0.06 0 0 0 -0.44 0 0.05 -0.01 0 0 0∗
28 Edd;Eda 0 0 0 2.47 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 5 29 Pta;A kA,A kB 0 0 0 0.0015 0 0 0 0 -0.33 0 0 0 0 0 0 0 0 -1.42 0 -0.26 -0.83 0 2.1 2∗
30 LdhA 0 0 0 2.67 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.11 0 0 6 31 AdhE 0 0 0 0 0 0 0 0 0.91 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28 %MissingData 3 17 0 48 7 34 59 10 3 72 3 38 3 59 3 14 14 0 62 79 79 17 17 ° 75246 Dis ussion
In this work we have addressed the problem of estimating parametersof ap-proximatemodelsof metaboli networks fromin ompletedatasets. Even with the largestdatasets available at present, su h asin [Ishiiet al.,2007℄, the ab-sen eor orruptionofalargenumberofmeasurementsmayredu etheee tive numberofdatapointstoahandfulofexperimental onditions,thusmaking sim-pleregressionte hniquesinee tiveoreveninappli able. Makingfulluseofall theavailable datais thereforeessentialto renderidenti ation well-posed and improvethequalityoftheestimatedmodels.
Tothisaim,wehaveproposedamaximum-likelihoodmethodforthe identi- ationoflinlogmetaboli networkmodelsthat ompensatesforthemissingdata bythe useofstatisti al priors. Wedeveloped analgorithm thatattains maxi-mizationofthelikelihoodbasedonExpe tationMaximization,awella epted paradigmfor thenumeri aloptimization oflikehood fun tions in thepresen e of unobservedvariables. A simplerimplementation based on dire t likelihood maximization viageneral-purpose numeri aloptimization algorithms wasalso onsidered andfoundslightlylesspowerful. Theperforman eofEMwas om-paredto that ofanexisting method of referen e,namelymultiple imputation, and to worst- ase and best- ase s enariosgiven byleast-squares regressionon thesole ompletedatapointsandon ompletedatasets,respe tively. Weshowed thatEMoutperformsmultipleimputationbyawidemargin. In omparisonwith worst- aseregression,itredu estheestimationvariabilityandisabletoprodu e reasonableestimationresultsevenwhenregressiononin ompletedatasetsis in-appli able. Italsoapproa hestheidealperforman e ofregressionon omplete datasets forlowratesofmissingdata,regardlessofnoise.
Basedonthesendings,weappliedEMtotheidenti ationofalinlogmodel forthe entral arbonmetabolismintheba teriumE. oli fromthe experimen-tal datapresentedbyIshiietal.[2007℄. Evenwith thelargeamountof in om-pletedatapoints, due to thedi ultyof experimentallymeasuringmetabolite on entrations,EMwasabletoestimatemany ofthemodelparameters (elas-ti ies)inagreementwiththe urrentunderstandingofthesystem. Thisiseven true for rea tions where the redu ed number of omplete datapoints impairs theappli abilityofleastsquaresregression. Ontheotherhand,the hallenging qualityofthedata shedslightontheperforman elimitsofthemethod, whi h tends to fail whenlargemeasurementnoisemakestheestimation ofsmall pa-rametersstatisti allyunreliable,whenthesamevariable annotbemeasuredin most onditions,orwhenrea tionsoperatenearequilibrium.
Overall,resultsfromthesimulationsandtheappli ationonrealdatashowed that our EM approa h is able to make the most of in omplete, noisy high-throughput datasets for the estimation of parameters in approximate kineti models. Inthefuture,weexpe ttoimproveperforman ebydevelopinga num-berofte hni alpoints,in ludingapproximateanalyti /dedi atednumeri al so-lutionsfortheEMmaximizationsteps,therenementoftheidentiability anal-ysis via SVD of in ompletedata matri es [Brand, 2002℄, and a moredetailed modelingofmeasurementnoise. Itisworthnotingthat, whilethemethodhas been developed on linlogmodels, it is more generallyappli able to any other metaboli network model that an be put in aform linear in the parameters by straightforwardmanipulations, su h as generated massa tion models that provideadvantageswhensomemetabolite on entrationsapproa h0[Savageau,
1976, delRosarioetal., 2008℄. In addition, estimated parameters of approxi-matemetaboli models,su haselasti itiesoflinlogmodels,provideusefulhints fortheidenti ationofmoredetailed nonlinearkineti models.
Fromabroaderperspe tive,theappli ationoftheEMmethodto aunique multi-omi s datasetforE. oli arbonmetabolismallowedus to isolateissues that are riti alfor theappropriateexploitation ofthe datafor parameter es-timation,andthat mayneedtobetakeninto a ountduringthedesignofthe experiments. Onesu hissueisthatahighper entageofmissingdataforsome oftheindividualvariables,evenatarelativelylowaverageper entageoverthe entiredataset, wasfound to bedetrimental to the identi ation results. This mayinuen e samplingstrategies, expe ially for metabolites that are di ult tomeasure. Anotherissueistheidentiabilityproblems ausedbysteady-state measurements,that annotalwaysberesolvedbygeneti mutation orby vary-ing physiologi al onditions. Fromthis perspe tivetime-resolvedobservations ofthenetworkdynami s arrygreatpromise[Hardimanetal.,2007℄,although mu hmoredemanding experimentally.
Appendi es
A Identiability analysis
First of all, we note that in model (7), for ea h rea tion
i
, only a subset of metabolites isinvolved. Thatis, onlyasubsetof theentries ofve torc
.i
need tobeidentied,whilethevaluesoftheremainingentriesarexedto0
. Letus alln
c
i
with0 < n
c
i
< (n + p)
the ee tivenumberof parametersto identify. Astraightforwardreformulationoftheregressionmodelisthefollowing:w
.i
= Y
′
· c
′
+ ε
.i
(16)with
c
′
∈ R
n
ci
a ve tor olle ting the nonzero values of
c
.i
andY
′
∈ R
q×n
ci
amatrix omposedof the orresponding olumnsof
Y
,i.e., themetabolites in-volvedin rea tioni
. Bearingthis in mind, forsimpli ity, we willdrop indexi
inthesequelwritingn
c
inpla eofn
c
i
sti kingtotheusualnotationw = Y ·c+ε
. To deal with identiability issues, we use Prin ipal Component Analysis (PCA) on the model (16) [Jollie, 1986, Nikereletal., 2006℄. To dete t non-identiableparameters,wede omposethedatamatrixY
usingSingularValue De omposition (SVD):Y = U · diag(s
1
, s
2
, ..., s
n
c
) · V
T
(17) withU ∈ R
q×q
andV ∈ R
n
c
×n
c
orthonormalmatri esand
s
1
≥ · · · ≥ s
n
c
≥ 0
thesingularvaluesof
Y
. Notethatthesumofsquaredsingularvaluesisequal to the square ofthe Frobenius norm ofY
,kY k
2
=
P
k
P
j
Y
j,k
2
, that is, in an equivalentstatisti alinterpretation,tothesumofthevarian esofallmetabolite on entrationsoverq
experiments(re allthatea h olumnofY
haszeromean bydenition).Inpresen eofdependen iesamongdata,thereexistsanindex
r
with1 ≤ r <
n
c
su hthats
r+1
= · · · = s
n
anytwove tors
w
andc
su hthatw = Y ·c
,thereexistsan(n
c
−r)
-dimensional ve torspa eK
Y
⊆ R
n−r
(thekernelof
Y
)su hthatw = Y · (c + k
Y
)
alsoholds foranyk
Y
∈ K
Y
. For thepurpose ofidenti ation, thisimpliesthatc
annot beuniquelyre onstru ted from the data. (The ase where(s
r+1
, · · · , s
n
c
)
are onlyapproximately0
willbedis ussedlaterin thisse tion.)We rely on the observation that
K
Y
=
range(V
r+1:n
c
)
, the ve tor spa e generated bythelastn
c
− r
olumns ofV
. Inorder to formulate aregression problemwithawell-denedsolution,werewritemodel(16)intermsofaredu ed datamatrixY
∗
∈ R
q×r
andaredu edparameterve tor
c
∗
∈ R
r
asfollows:w = Y
∗
· c
∗
+ ε
Y
∗
= Y · V
1:r
(18) whereV
1:r
∈ R
n
c
×r
is the matrix obtained by extra ting the rst
r
olumns ofV
. Sin eY
∗
is full- olumn rank,Y
∗
· c
∗
is a linear ombination inc
∗
of independentdatave tors. Thisensuresthatthesolutiontotheregression(18) isunique,hen ewe allc
∗
`identiable'. Givenauniquesolution
c
∗
totheregression(18),thespa eof undistinguish-ablesolutionsfortheoriginalparameterve tor
c
in(16) anthenbedenedas follows:c ∈ {V
1:r
· c
∗
+ k
Y
, k
Y
∈ K
Y
}
(19) Depending on thestru ture ofthe orthonormalmatrixV
, wemaybeable to isolatesomeentries ofc
that anbeuniquely determinedfrom theredu ed model,that is,from theestimatesofc
∗
. Thishappenswhenallelementsofat least one row of
V
r+1:n
c
are equal to0
. Indeed, this is the riterion we used in Se . 5toisolateidentiableparametersinnonidentiablerea tions,su has rea tions17and19in Table1.In pra ti e, singular values are rarely exa tly
0
, even in presen e of data dependen ies. This anbeduetoseveral auses,in ludingmeasurementnoise, numeri alroundos,et . Still,forsomer < n
c
, valuesof(s
r+1
, · · · , s
n
c
)
su- iently lose to0
anmaketheestimates ofc
solvingregressionpoorly deter-mined. Thus a riteriontodis ardthosesingularvaluesneedstobedened.Inthispaper,weapproximatebyzeroallsingularvalueswhosetotal ontri-bution tothe varian eof
Y
,i.e., tothe `informativity' ofthemetabolite data, isunder somesuitablethresholdλ
,that is,wedener = min
(
t ∈ [1..(n
c
− 1)] ,
P
t
k=1
s
2
k
P
n
c
k=1
s
2
k
≥ λ
)
(20) and sets
r+1
= · · · = s
n
c
= 0
. By this approximation, from (17) weobtaina datamatrixY
ofrankr < n
c
. ThePCAmethoddes ribedbeforethenapplies. Theresultsdis ussedin Se .4wereobtainedwithλ = 0.99
. Intheappli ation to the biologi al model of Se . 5 , the hoi e ofλ
was driven by the level of measurementnoise orrupting the data matrixY
. This wasquantied based on thevarian e ofthe data over theavailable repli as in a xedexperimental ondition[Ishiietal., 2007℄. Depending ontherea tion,the valueofλ
ranged from0.94
to0.999
.B Likelihood-based identi ation of linlog mod-els
Werelyonthenotationof Se . 3,i.e., wefo usonasinglerea tionand drop index
i
fromthenotation. Theloglikelihoodofthemodelis:log L (c) = log
Z
f
W
|˘
y,˜
y,c
(w)f
Y
e
|˘
y,c
(˜
y)d˜
y
(21)For onvenien e, werewrite(21)in termsoftherandomvariable
Z = e
Y · c
introdu edinSe . 3sothatitbe omes:log L (c) = log
Z
f
W
|˘
y,z,c
(w)f
Z|˘
y,c
(z)dz.
(22)Here
f
W
|˘
y,z,c
(·)
istheGaussianlikelihoodfun tionofmodel(7),equivalently rewritten asW = ˘
Y · c + Z + ε
,givenY = ˘
˘
y
andZ = z
, withz
varying over allpossiblevaluesofZ
,andf
Z|˘
y,c
istheGaussian priorofZ = e
Y · c
following from(10). Theexpressionsoff
W
|˘
y,z,c
andf
Z|˘
y,c
arethus
f
W
|˘
y,z,c
(w) =
√
1
det(2πΣ
ε
)
exp(−
1
2
[w − ˘
Y · c − z]
T
Σ
−1
ε
[w − ˘
Y · c − z]),
f
Z|˘
y,c
(z) =
√
det(2πΣ
1
˘
y,c
)
exp(−
1
2
[z − µ
y,c
˘
]
T
Σ
−1
˘
y,c
[z − µ
y,c
˘
]),
(23) withµ
y,c
˘
= M · c
, where theentryM
j,k
of matrixM
is the meanµ
j,k
of the distributionofY
e
j,k
, andΣ
y,c
˘
isthevarian ematrixof therandomvariableZ
. Bytheindependen e assumptionsonY
e
,itturnsoutthatΣ
y,c
˘
= diag
n
c
X
k=1
c
2
k
· [σ
1,k
2
· · · σ
q,k
2
]
T
!
.
(24) whereσ
j,k
isdenedin (10) .Assumeforthemomentthat
Σ
y,c
˘
is invertible. Deningf
p
c
(z) = f
W
|˘
y,z,c
(w) · f
Z|˘
y,c
(z),
(25) aftersimplebuttedious al ulations,weobtainf
p
c
(z) = κ
f
c
· f
c
(z)
(26) withf
c
thedensityfun tionofaGaussiandistributionN
(µ
f
c
, Σ
f
c
)
and
Σ
f
c
= [Σ
−1
ε
+ Σ
−1
y,c
˘
]
−1
µ
f
c
= Σ
f
c
· [Σ
−1
ε
· (w − ˘
Y · c) + Σ
−1
y,c
˘
· µ
y,c
˘
]
κ
f
c
=
exp(−
1
2
[w− ˘
Y
·c−µ
√
y,c
˘
]
T
·[Σ
ε
+Σ
y,c
˘
]
−1
·[w− ˘
Y
·c−µ
y,c
˘
])
det(2π[Σ
ε
+Σ
y,c
˘
])
.
(27)
Theproportionalityfa tor
κ
f
c
doesnotdepend ontheintegration variablez
,soit an betakenoutoftheintegraland(22) anberewrittenasfollows:log L (c) = log(κ
f
c
) + log
Z
f
c
(z)dz
Theintegralof anormalizedGaussian density fun tion being
1
, we nally haveananalyti alexpressionfortheloglikelihood:log L (c) = log(κ
f
c
)
. The above results are used in the expe tation step of the EM algorithm. Re allthedenitionQ(c|ˆc
ℓ−1
) =
Z
log(f
Z,W
|˘
y,c
(z, w))f
Z|˘
y,ˆ
c
ℓ−1
,w
(z)dz.
(29)TheBayestheoremallowsustorewrite(29)asfollows:
Q(c|ˆc
ℓ−1
) =
Z
log(f
W
|˘
y,z,c
(w)f
Z|˘
y,c
(z))
f
W
|˘
y,z,ˆ
c
ℓ−1
(w)f
Z|˘
y,ˆ
c
ℓ−1
(z)
f
W
|˘
y,ˆ
c
ℓ−1
(w)
dz.
(30)Fun tion
f
W
|˘
y,ˆ
c
ℓ−1
(w)
does notdepend onz
so it an be takenout of the integral. Moreover,thisfun tiondoesnotdependonc
soitwillhavenoimpa t on the maximizationstep of EM. Thus, we an ignore this fun tion from the omputationoftheexpe tation fun tionabove.Usingdenitions(25)and(26) ,we anrewrite(30)inthefollowingway:
Q(c|ˆc
ℓ−1
) ∝
Z
κ
f
ˆ
cℓ−1
f
ˆ
c
ℓ−1
(z) log(κ
f
c
f
c
(z))dz
∝
Z
f
ˆ
c
ℓ−1
(z) log(κ
f
c
f
c
(z))dz.
(31)Wehavedroppedthe onstantfa tor
κ
f
ˆ
cℓ−1
asitdoesnotdepend on
c
and thusdoesnotinuen ethemaximizationstepofEM.Byrepla inglog(κ
f
c
f
c
(z))
by
log(κ
f
c
f
c
(z)f
c
ˆ
ℓ−1
(z)/f
ˆ
c
ℓ−1
(z))
andseparatingtheintegrandinasumofterms, we an rewrite(31)as−Q(c|ˆc
ℓ−1
) ∝
Z
f
ˆ
c
ℓ−1
(z) log
f
c
ˆ
ℓ−1
(z)
f
c
(z)
dz−
Z
f
ˆ
c
ℓ−1
(z) log(f
c
ˆ
ℓ−1
(z))dz−log(κ
f
c
).
(32) Were ognizein therstterm thedenition of the Kullba k-Leibler diver-gen eKL(f
c
||f
ˆ
c
ℓ−1
)
betweenthetwoprobabilitydistributionsf
c
andf
ˆ
c
ℓ−1
and in the se ond term the entropyH(f
ˆ
c
ℓ−1
)
off
ˆ
c
ℓ−1
[CoverandThomas, 2006, StoorvogelandvanS huppen, 1996℄. ForGaussiandistributions, these anbe written expli itelyasKL(f
c
||f
ˆ
c
ℓ−1
) =
1
2
(log
det(Σ
f
c
)
det(Σ
f
cℓ−1
ˆ
)
!
+ T r(Σ
−1
f
c
Σ
f
cℓ−1
ˆ
)
+ [µ
f
c
− µ
f
ˆ
cℓ−1
]
T
Σ
−1
f
c
[µ
f
c
− µ
f
cℓ−1
ˆ
]),
(33) whereT r(. . .)
standsfortra eandH(f
ˆ
c
ℓ−1
) = log
q
det(2πeΣ
f
ˆ
cℓ−1
)
.
(34)Tosummarize,togetherwith(27) ,thisgivesustheexpli itformula
whi hweemployin ourimplementationofEM.
In more generality, for some values of
c
,Σ
y,c
˘
may be singular or poorly onditioned. Toavoidthis ir umstan e,we anadaptourpro edureasfollows. We onsiderade ompositionW = ˘
Y · c + Z + (ε
′
+ ε
′′
) = ˘
Y · c + (Z + ε
′
) + ε
′′
(36) whereε
′
andε
′′
areindependentzero-meanGaussianrandomve torssu hthat
Σ
ε
′
, V ar(ε
′
) = αΣ
ε
andΣ
ε
′′
, V ar(ε
′′
) = (1 −α)Σ
ε
,withα ∈ (0, 1)
atunable parameter. Sin eΣ
ε
> 0
byassumption, it followsthatΣ
ε
′
> 0
andΣ
ε
′′
> 0
. Moreover,Σ
ε
= Σ
ε
′
+ Σ
ε
′′
, i.e., thestatisti sofε
andofε
′
+ ε
′′
areidenti al. Sin eV ar(Z + ε
′
) = Σ
˘
y,c
+ Σ
ε
′
> 0
, if we interpretZ + ε
′
as the unknown observations(inpla eofZ
)andε
′′
asthemodelnoise(inpla eof
ε
),weensure thatthevarian eofthe`missingdata'isinvertible. Thus,inpra ti e,weapply allformulasdevelopedabovewithΣ
y,c
˘
+ Σ
ε
′
in pla e ofΣ
y,c
˘
andΣ
ε
′′
in pla e ofΣ
ε
.Theee tofthespe i hoi eof
α
isunderinvestigation. Inthiswork,we tookα = 0.2
,avaluethatleadstogoodresultsin pra ti e.C Validation on syntheti data
Themodel usedfor omparingpeforman eoftheidenti ationalgorithmsisa redu edsyntheti linlogmodeloftheE. oli entral arbonmetabolismnetwork (Fig. 4). This network ontains 17 variables, des ribing internal and external metabolites,and25rea tions,summarizedinTable2andTable3,respe tively. Thelinlogmodelhastheform ofEq.(1)-(2).
Index Name Symbol
1 Pyruvate PYR 2 Phosphoenol-pyruvate PEP 3 Gly eraldehyde-3-phosphate G3P 4 Fru tose-6-phosphate F6P 5 Glu ose-6-phosphate G6P 6 3-phosphogly erate 3PG 7 Dihydroxya etonephosphate DHAP 8 Ribulose-5-phosphate Ru5P 9 Ribose-5-phosphate R5P 10 6-phosphoglu onate 6PG 11 Erythrose-4-phosphate E4P 12 Xylulose-5-phosphate X5P 13 2-phosphogly erate 2PG 14 1,3-diphosphosphogly erate 1,3DP 15 Fru tose-1,6-bisphosphate F1,6P 16 2-keto-3-deoxy-6-phosphoglu onate KDPG 17 Sedoheptulose-7-phosphate S7P Table2: Metabolitesin ludedinthesyntheti linlogmodel.
Index Name 1 Phosphotransferasesystem 2 Glu ose-6-phosphateisomerase 3 Glu ose-6-phosphatedehydrogenase 4 Phosphofru tokinase 5 Transaldolase 6 Transketolasea 7 Transketolaseb 8 Aldolase 9 Gly eraldehyde-3-phosphatedehydrogenase 10 Triosephosphateisomerase 11 Gly erol-3-phosphatedehydrogenase 12 Phosphogly eratekinase 13 Serinesynthesis 14 Phosphogly eratemutase 15 Enolase 16 Pyruvatekinase 17 PEP arboxylase 18 Pyruvatesynthesis 19 6-Phosphoglu onatedehydrogenase 20 Ribose-phosphateisomerase 21 Ribulose-phosphateepimerase 22 Ribose-phosphatepyrophosphokinase 23 Phosphoglu onatedehydratase 24 KDPGaldolase 25 Fru tosebisphosphatase