Contents lists available atScienceDirect
International Journal of Approximate Reasoning
www.elsevier.com/locate/ijar
Bayesian selection of graphical regulatory models
✩Silvia Liverania,∗,Jim Q. Smithb
aDepartmentofMathematics,BrunelUniversityLondon,UK bDepartmentofStatistics,UniversityofWarwick,Coventry,UK
a r t i c l e i n f o a b s t r a c t
Articlehistory:
Received15October2015
Receivedinrevisedform5May2016 Accepted31May2016
Availableonline15June2016
Keywords:
Clustering Bayesianinference Causality
Time-coursemicroarrayexperiments
We define a new class of coloured graphical models, called regulatory graphs. These graphs have their own distinctive formal semantics and can directly represent typical qualitativehypothesesaboutregulatoryprocesseslikethosedescribedbyvariousbiological mechanisms.Theyadmitanembellishmentintoclassesofprobabilisticstatisticalmodels and sostandard Bayesian methodsofmodel selection canbeused tochoose promising candidate explanations of regulation. Regulation is modelled by the existence of a deterministicrelationshipbetweenthelongitudinalseriesofobservationslabelled bythe receivingvertexandthedonatingone.Thisclasscontainslongitudinalclustermodelsasa degenerategraph.Edgecoloursdirectlydistinguishimportant featuresofthemechanism likeinhibitionandexcitationandgraphs areoftencyclic.Withappropriatedistributional assumptions,becausetheregulatory relationshipsmapontoeachotherthroughagroup structure,it ispossible todefineaconditionalconjugate analysis.Thismeansthateven whenthe model spaceishuge itisneverthelessfeasible, usingaBayesian MAPsearch, to a discover regulatory network with a high Bayes Factor score. We also show that, liketheclassofBayesianNetworks,regulatorygraphsalsoadmitaformalbutdistinctive causalalgebra.Thetopologyofthegraphthenrepresentscollectionsofhypothesesabout thepredicted effectofcontrollingthe processbytearingoutmessagepassers orforcing themtotransmit certainsignals. We illustrateourmethodsonamicroarrayexperiment measuring the expression of thousands of genes as a longitudinal series where the scientificinterestliesinthecircadianregulationoftheseplants.
©2016TheAuthor(s).PublishedbyElsevierInc.Thisisanopenaccessarticleunderthe CCBY-NC-NDlicense(http://creativecommons.org/licenses/by-nc-nd/4.0/).
1. Introduction
There has been a recentexplosion in the development of newtechniques in exploratory dataanalysis forhuge data setsfuelled,forexample,byintenseactivitiesinthestudyofgenomesequencesandDNAmicroarrayanalyses.Examplesof such exploratorytoolsarethefastBayesian methods[7,21,8],designedtocluster tensofthousandsoflongitudinalseries ofmeasurements.Theoutputs fromthesemethods,especially thedisplaysofthecluster means(seee.g. Fig. 1)ofa MAP estimatedmodel, haveproved to be particularlyhelpfulto biologistswhen they explore hypotheses aboutwhichsets of genesmightbepassingamessagetoanother.
HowevertheseBayesianclustermethodsbeginwiththeassumptionthatclustersexpressindependentlyofoneanother.
Thisisclearly notusuallyacrediblehypothesisforregulation.Forunderahypothesisthatoneclusterregulatesanotherit
✩ ThispaperispartofthevirtualspecialissueonStatisticalandcomputationalmethodsforgenomicsandintegrativegenomics,editedbyPaolaRancoita andCassioCampos.
* Correspondingauthor.
E-mailaddress:[email protected](S. Liverani).
http://dx.doi.org/10.1016/j.ijar.2016.05.007
0888-613X/©2016TheAuthor(s).PublishedbyElsevierInc.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/).
istypicallybelievedthattheshapeoftheprofileofthereceivingclusterwillberelatedtotheprofileofthesendingcluster inarelativelypredictableway.
Various authorshavethereforemodelledthelikely dependencesbetweenthetime coursesofgeneexpressionprofiles usingavarietyofstandardgraphicalmodels:oftenvariantsoftheBayesianNetwork(BN)[9].Thus[12],acknowledgingthe consequentcomputationalandmethodologicallimitationsoftheirtechniques,useddynamicBNstomodelsubsetsofseries like those we model here. Albeitin a non-exploratory context[19] analysed variations ofa baseline model representing knowndependencestructures.Othergraphicalstudiesofregulatorymodelsinclude[4].
Althoughthesegraphicalmodelsareclearlyveryuseful,particularlyatlaterstagesoftheanalysisofagivenprocess,they arehardlyeverintegratedintoanearlyexploratorydataanalysis[2,13,20].Furthermorewearguebelowthatthesemantics oftheBNisnotwellmatchedtomodellingdependencesinducedbyregulation.Ratherthanrepresentingconditionalinde- pendence relationships,ideallywe wouldlikeourgraphtobe explicitlyregulatory:i.e.thatadirectededgefromaparent node v1isconnectedtoachildv2 ifandonlyifthemodelembodiesthehypothesisthattheobjectv1regulates v2.Clearly thisisnotingeneraltrueofaBN.Inparticularthereisnowayitsacyclicgraphcouldembodyanycyclicalregulation,the mostimportanttypeofrelationshipinourrunningexample.
InthispaperwedevelopanewgraphicalmodelfordepictingregulatoryrelationshipswithinaGaussianfamily,building on aclassofBayesianclustermodels thathavebeenwidelyandsuccessfullyimplemented [3,7].Themodelclassembeds the class ofBayesian longitudinalcluster models asaspecial case. Weare aware ofno similarwork in theliterature, in scopenordesign.Wedemonstratehowthemethodprovides:
1. afeasibleexploratorytechniqueforregulatorynetworksoflongitudinalhigh-dimensionaldata;
2. aformalsemanticsthatembodiesmostoftheusefulpropertiesoftheBN;
3. aframeworkforaconditionalconjugateanalysis;
4. aformal graphicalrepresentationofa MAPmodelwhichcan representthetypesofdependencesthe scientistmight hypothesise; thisgraphicalrepresentation willbe ina formwhich closelycorresponds tothe lessformal graphs she mightcurrentlyusetodepictthatmodel’sregulatorymechanism;
5. anassociatedcausalalgebraenabling thescientist,undercertainassumptions, togeneratepredictionsoftheeffectof certaintypesofcontrolonthesystem.
Thisworkismotivatedbyourcollaborationwithbiologistssoourrunningexampleconcernsthestudyofthestatistical modelsoftheregulationmechanismofthecircadianclockinplants.However,theapplicabilityofthisnewclassofmodels extendsbeyondgeneticsandbioinformatics.
InSection 1.1we introducethedatausedinthepaperandinSection2webriefly reviewtheformofBayesian cluster analysis which we plan to generalise. Then in Section 3 we use a group action which is hypothesised to embody the relationshipoftheprofileshapeofareceivingclusterandtheprofileofitsregulatingcluster.Thisgroupdefinesequivalence classes ofclusterscalled supraclusters.These supraclusterswill define the disconnectedcomponents ofthe graphof our process.InSection4wediscussgraphicalrepresentationsforsupraclusters.Wecompleteourmethodologicaldevelopment in Section 5wherewe providea causalinterpretation forourgraphs ofsupraclusters.We thendemonstrate ourmethod usingtwoexamplesinSection6:asimulatedscenarioandarealsystem.Inthiswayweillustratenotonlythatourmethods arefeasibleandevocativebutalsohowtheoutputgraphsprovideahelpfulandfamiliarframeworkthroughwhichresults canbefedbacktothescientist.
1.1. Data
Wedemonstratethatourmethodsuccessfullyidentifiesdependencebetweenclustersofgenes intwoexamples:a sim- ulatedrunningexampleandamoredetailedanalysisofagenuinemicroarrayexperimentontheplantArabidopsisthaliana.
These experiments were designedto detect genes whose expression levels, andhencefunctionality, might be connected withcircadianrhythms.Theaimwastoinvestigatethepossibleidentityandroleofthosegenesinvolvedintheregulation ofthecircadianclockofaplant. Forourrunningexamplewesimulated5clustersof30geneseach,ofthetype discussed in[5].HerethegeneexpressionwasmeasuredatT=13 timepointsovertwodays.Constantwhitelightwasshoneonthe plantsfor26hoursbeforethefirstmicroarraywastaken,withsampleseveryfourhours.Thus,therearetwocyclesofdata foreachoftheArabidopsismicroarraychip.Thedata yzi wassimulatedfrom
yi=λziB(θz1)Rziβ+εi
wherethesubscriptzi= jifindividualibelongstocluster jwith j=1,. . . ,5 and εi∼Normal(0,4).ThematrixB(θzi)isa Fourierbasisfunction(seeEquation(2)formoredetails)andRzi isarotationmatrix,thatis,ablockdiagonalmatrixwhere eachblockisamatrixM(k,θzi)as
M(k, θzi)=
cos(kθ ) −sin(kθ ) sin(kθ ) cos(kθ )
withkcorrespondingtotheindexoftheFouriercoefficients.Thegeneratingvaluesofλzi,θzi andβ aregiveninSection6.
Forthesecondexampleweanalysedthisdatasetinfull (22,810genes whoseexpressionsareeach observedover13time point)aswellasotherdatasets,aswediscussinSection6.
2. Clustermodelsforlongitudinaldata
ThestudyofthefamilyofregulatorygraphsproposedinthispaperbeganwithourworkonBayesianclustermodels.In theapplicationsoftheproposedmethodologythereare N –usuallytensofthousands–longitudinalsetsofmeasurements oftheactivityofunitstakenatparticulardiscretepointsovertime.Tounderstandtheunderlyingprocessesitistherefore importanttolearnwhichoftheseprocessesarecopiesofeachother,asevidencedbythefactthat theirprofilesappearto bereplicates.Inthiswaywecanreducethetensofthousandsoftimeseriesintoamuchsmallernumberof K clusters.
TypicallyinourapplicationsK dependsonhowwesetthehyperparametersofourmodels.Howeverforourrealdataset thereareusuallyaboutonehundredlongitudinalclusters:astilllargebutmoremanageablenumber.Eachoftheselongitu- dinalclustersisthentreatedasacandidateregulator.Theunderlyingdependencestructurecanberepresentedgraphically anditcanbefurtherelaboratedtodescribehypothesesabouttheeffectsofthetypesofcontrolseludedtoabove.
Wefirstneedtosetupsomenotation.LetC{C1,C2, . . .CK}denoteapartitionofindices{1,2, . . . ,N}, N≥K,where Nk denotes the cardinalityof cluster Ck.Let Yi, i=1,2,. . . ,N, denotea set of N T-vectors ofreal valuedobservations of the continuous time development ofa unit at T given time points – henceforth calleda profile. Let DC, the T ×Nj matrixofvaluesbedefined by DC= {Yi:i∈C}forsomesubset C of{1,2, . . . ,N}.Letthedistributionof{Y1,Y2, . . . ,YN} be parametrised by θ∈, andfixed hyperparameters ϕ∈.Let θ=(θ1,θ2, . . .θK) whereθk,for k=1,2,. . . ,K isthe parametervectorspecific tothek-thclusterwithNk elements.NotethatthenK
k=1Nk=N.ThefirstBayesianmodelwe reviewisonewhichadmitsnodependenceorcausalstructurebetweenanyoftheobjectsofinterest–theclusterprofiles.
Definition1.ABayesclustermodelC ofasetofprofiles{Y1,Y2, . . . ,YN}assumesthat:
1. profilesindifferentclustersandtheirassociatedparametersareindependentgiventhehyperparametersi.e.
kK=1 DCk,θk
|ϕ
2. withineachclustertheprofilesYi|θk,ϕareidenticallydistributed,i=1,2,. . . ,Nk,k=1,2,. . . ,K andconditionallyon eachcluster’sparametervectorθkandthehyperparameters ϕ theseobservationsaretakenindependently:i.e.
Ni=k1Yi|θk,ϕ
TherenowareseveralsuchBayesianclustermodelswhichhavebeensuccessfullyandusefullyimplemented[7,16,5,17].
ManytaketheformthatforclusterCk thetimeseriesofobservationsismodelled byalinearregression
Y(ik)=X(k)β(k)+ε(ik) (1)
fork=1,. . . ,K andi=1,. . . ,Nk whereβ(k)∈Rp is thevector ofparameters with p≤T, whereε(ik) are independent identicallydistributedmeasurementerrorswithvariance σk2 andthedesignmatrix,orbasisfunction,X(k),amatrixofsize NkT×p,is customisedto the hypothesisedunderlying process. Inour running examplewhere interest liesin regulators actingoveratwentyfourhourcycle,anappropriatebasisfunctionisFourier.Sothen,forexamplewhen piseven,
y(itk)=β1(k)+
p/2
i=1
β2i(k)cos(2πt(2i)/T)+
p/2
i=1
β2i(k+)1sin(2πt(2i)/T)+εit(k). (2) Here theparameters θk associatedwithcluster Ck are thecorresponding regressionparameters βk andan errorvariance parameterlinkedtothewithinclusterdiversity.
Withintheseclassesitisoftenpossible,usinganappropriate conjugateanalysis,toconstructascorefunctionbasedon thecorrespondingBayesFactor–aMAPscorewhichcanbederivedinclosedformconditionalonthehyperparametersfor eachpartitionoftheprofiles.ThusiferrorsandpriorparameterdistributionsareassumedGaussian,andthenoisevariance aprioriinverseGammadistributedthenconditionalonafixednoise-to-signalratiomatrix,themarginallikelihoodofeach possibleclusterpartition isaproduct ofmultivariate t densities.Forthedetails ofhow thesemethods workseee.g.[3,7, 17]and[10].Ithasnowbeenestablishedthat theseandrelatedmethodologiescansuccessfullyandfeasiblyidentifywell supportedmodelswithinthesetypicallyhugesearchspaces.
Themethodsdiscussed inthissection arewidelyusedintheliterature onclusteringofmicroarray experiments.Fig. 1 depictstheexpressionprofileswithin afewtypicalclusterstogether withgraphsoftheclusters’ meanFouriercoefficients intheanalysisgivenin[5],oneofthemanyanalysesperformedusingthismethodology.
Ourexperience usingthese methodshave demonstratedthat withthe carefulsettingof hyperparametersthe outputs ofsuch modelsare remarkablyrobust,both formallyandpractically, whenit comesto theestimationofparameters that mightbelinkedtoregulation[5,17,10,11].
Fig. 1.Fouroftheclustersobtainedclusteringthedatain[5],ourrealdataapplication.Thebluelinerepresentstheposteriormean,thethickercoloured linesarewellknowngenesandthebarplotontheleftaretheposteriorestimatesoftheFouriercoefficients.(Forinterpretationofthereferencestocolour inthisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)
3. Supraclusteringandregulatorymodels
Thistype ofcluster analysiscan beextended tomodeldirectlythedependences betweenclustersifaregulatory rela- tionship betweenthem canbespecifiedasfollows.Atransmissionfunctionisdefinedwhichmapstheparametersofone unit’s profileon tothe parameters oftheprofile it regulates.These functionsarethen coded asa groupof transmission matrices,customisedtomirrorthetypesofdependencesinducedbyaparticularhypothesisedregulatingmechanism.
In this section we fix the value of the cluster hyperparameter ϕ∈ and for clarity suppressthis parameter in the indexing.Then forthetransitionhyperparametersφ∈,therearetwo groupsusedtomap theshapeofoneprofileonto another(L∗)andonecluster’sparametersontoanother(L).
Forsomeintegerm letL∗
L∗(φ):φ∈⊆Rm beafamilyofT×T transitionmatricesusedtodefinea regulatory relationshipfromaprofileYi1 ofagivenuniti1toasecondunitwiththeprofileYi2,wherethetransitionhyperparameter φ indexeseach matrixin thisfamily.AssociatedwithL∗ isa set L
L(φ):φ∈⊆Rm of(p+1)×(p+1)matrices mappingtheparametersθj1,i∈Cj1 totheparametersoftheregulatedunitθj2,i2∈Cj2,suchthat
θj2L(φj1→j2)θj1
ThefamilyoftransitionfunctionsLparametrisedbyφj1→j2mustfirstbeelicitedfromthedomainexpertsincethismust be determinedfromthose featuresinthedata shebelieveswillbeindicativeofa regulatoryrelationship.Inour running examplesuch relationshipswerecommunicatedtous:“Thetime profileofexpressionofaregulatedgeneusually appears tobeashorttranslationoftheprofileoftheregulatingclusterofgenes,usuallyrescaledandsometimes,ifinhibitedrather than excited, reversed”.This determineda 3parametertransitionfunction witha parametercapturingthephase change, another expressing apositive rescaling anda third representinginhibition, which correspondedtoa simple signchange.
In thispaper, bothforsimplicity andbecausethe subclassseemed tofit well, we confinedourselves tosearch over this parameterclass,althoughotherofourmorecomplexmodelscouldalsomodeldampingeffects.Ofcourseinothersettings andwithdifferentinterpretationsofbasisfunctionsthegroupoftransformationscouldbeverydifferent.
For thepurposes of thispaperwe will assume, asinour runningexample, that both L∗ and Lforma group under matrixmultiplication.Writei2i1 wherei2∈Cj2,i1∈Cj1 andCj1,Cj2∈C iff∃φj1→j2∈suchthat