• Aucun résultat trouvé

Bayesian selection of graphical regulatory models International Journal of Approximate Reasoning

N/A
N/A
Protected

Academic year: 2021

Partager "Bayesian selection of graphical regulatory models International Journal of Approximate Reasoning"

Copied!
18
0
0

Texte intégral

(1)

Contents lists available atScienceDirect

International Journal of Approximate Reasoning

www.elsevier.com/locate/ijar

Bayesian selection of graphical regulatory models

Silvia Liverania,,Jim Q. Smithb

aDepartmentofMathematics,BrunelUniversityLondon,UK bDepartmentofStatistics,UniversityofWarwick,Coventry,UK

a r t i c l e i n f o a b s t r a c t

Articlehistory:

Received15October2015

Receivedinrevisedform5May2016 Accepted31May2016

Availableonline15June2016

Keywords:

Clustering Bayesianinference Causality

Time-coursemicroarrayexperiments

We define a new class of coloured graphical models, called regulatory graphs. These graphs have their own distinctive formal semantics and can directly represent typical qualitativehypothesesaboutregulatoryprocesseslikethosedescribedbyvariousbiological mechanisms.Theyadmitanembellishmentintoclassesofprobabilisticstatisticalmodels and sostandard Bayesian methodsofmodel selection canbeused tochoose promising candidate explanations of regulation. Regulation is modelled by the existence of a deterministicrelationshipbetweenthelongitudinalseriesofobservationslabelled bythe receivingvertexandthedonatingone.Thisclasscontainslongitudinalclustermodelsasa degenerategraph.Edgecoloursdirectlydistinguishimportant featuresofthemechanism likeinhibitionandexcitationandgraphs areoftencyclic.Withappropriatedistributional assumptions,becausetheregulatory relationshipsmapontoeachotherthroughagroup structure,it ispossible todefineaconditionalconjugate analysis.Thismeansthateven whenthe model spaceishuge itisneverthelessfeasible, usingaBayesian MAPsearch, to a discover regulatory network with a high Bayes Factor score. We also show that, liketheclassofBayesianNetworks,regulatorygraphsalsoadmitaformalbutdistinctive causalalgebra.Thetopologyofthegraphthenrepresentscollectionsofhypothesesabout thepredicted effectofcontrollingthe processbytearingoutmessagepassers orforcing themtotransmit certainsignals. We illustrateourmethodsonamicroarrayexperiment measuring the expression of thousands of genes as a longitudinal series where the scientificinterestliesinthecircadianregulationoftheseplants.

©2016TheAuthor(s).PublishedbyElsevierInc.Thisisanopenaccessarticleunderthe CCBY-NC-NDlicense(http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction

There has been a recentexplosion in the development of newtechniques in exploratory dataanalysis forhuge data setsfuelled,forexample,byintenseactivitiesinthestudyofgenomesequencesandDNAmicroarrayanalyses.Examplesof such exploratorytoolsarethefastBayesian methods[7,21,8],designedtocluster tensofthousandsoflongitudinalseries ofmeasurements.Theoutputs fromthesemethods,especially thedisplaysofthecluster means(seee.g. Fig. 1)ofa MAP estimatedmodel, haveproved to be particularlyhelpfulto biologistswhen they explore hypotheses aboutwhichsets of genesmightbepassingamessagetoanother.

HowevertheseBayesianclustermethodsbeginwiththeassumptionthatclustersexpressindependentlyofoneanother.

Thisisclearly notusuallyacrediblehypothesisforregulation.Forunderahypothesisthatoneclusterregulatesanotherit

ThispaperispartofthevirtualspecialissueonStatisticalandcomputationalmethodsforgenomicsandintegrativegenomics,editedbyPaolaRancoita andCassioCampos.

* Correspondingauthor.

E-mailaddress:[email protected](S. Liverani).

http://dx.doi.org/10.1016/j.ijar.2016.05.007

0888-613X/©2016TheAuthor(s).PublishedbyElsevierInc.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/).

(2)

istypicallybelievedthattheshapeoftheprofileofthereceivingclusterwillberelatedtotheprofileofthesendingcluster inarelativelypredictableway.

Various authorshavethereforemodelledthelikely dependencesbetweenthetime coursesofgeneexpressionprofiles usingavarietyofstandardgraphicalmodels:oftenvariantsoftheBayesianNetwork(BN)[9].Thus[12],acknowledgingthe consequentcomputationalandmethodologicallimitationsoftheirtechniques,useddynamicBNstomodelsubsetsofseries like those we model here. Albeitin a non-exploratory context[19] analysed variations ofa baseline model representing knowndependencestructures.Othergraphicalstudiesofregulatorymodelsinclude[4].

Althoughthesegraphicalmodelsareclearlyveryuseful,particularlyatlaterstagesoftheanalysisofagivenprocess,they arehardlyeverintegratedintoanearlyexploratorydataanalysis[2,13,20].Furthermorewearguebelowthatthesemantics oftheBNisnotwellmatchedtomodellingdependencesinducedbyregulation.Ratherthanrepresentingconditionalinde- pendence relationships,ideallywe wouldlikeourgraphtobe explicitlyregulatory:i.e.thatadirectededgefromaparent node v1isconnectedtoachildv2 ifandonlyifthemodelembodiesthehypothesisthattheobjectv1regulates v2.Clearly thisisnotingeneraltrueofaBN.Inparticularthereisnowayitsacyclicgraphcouldembodyanycyclicalregulation,the mostimportanttypeofrelationshipinourrunningexample.

InthispaperwedevelopanewgraphicalmodelfordepictingregulatoryrelationshipswithinaGaussianfamily,building on aclassofBayesianclustermodels thathavebeenwidelyandsuccessfullyimplemented [3,7].Themodelclassembeds the class ofBayesian longitudinalcluster models asaspecial case. Weare aware ofno similarwork in theliterature, in scopenordesign.Wedemonstratehowthemethodprovides:

1. afeasibleexploratorytechniqueforregulatorynetworksoflongitudinalhigh-dimensionaldata;

2. aformalsemanticsthatembodiesmostoftheusefulpropertiesoftheBN;

3. aframeworkforaconditionalconjugateanalysis;

4. aformal graphicalrepresentationofa MAPmodelwhichcan representthetypesofdependencesthe scientistmight hypothesise; thisgraphicalrepresentation willbe ina formwhich closelycorresponds tothe lessformal graphs she mightcurrentlyusetodepictthatmodel’sregulatorymechanism;

5. anassociatedcausalalgebraenabling thescientist,undercertainassumptions, togeneratepredictionsoftheeffectof certaintypesofcontrolonthesystem.

Thisworkismotivatedbyourcollaborationwithbiologistssoourrunningexampleconcernsthestudyofthestatistical modelsoftheregulationmechanismofthecircadianclockinplants.However,theapplicabilityofthisnewclassofmodels extendsbeyondgeneticsandbioinformatics.

InSection 1.1we introducethedatausedinthepaperandinSection2webriefly reviewtheformofBayesian cluster analysis which we plan to generalise. Then in Section 3 we use a group action which is hypothesised to embody the relationshipoftheprofileshapeofareceivingclusterandtheprofileofitsregulatingcluster.Thisgroupdefinesequivalence classes ofclusterscalled supraclusters.These supraclusterswill define the disconnectedcomponents ofthe graphof our process.InSection4wediscussgraphicalrepresentationsforsupraclusters.Wecompleteourmethodologicaldevelopment in Section 5wherewe providea causalinterpretation forourgraphs ofsupraclusters.We thendemonstrate ourmethod usingtwoexamplesinSection6:asimulatedscenarioandarealsystem.Inthiswayweillustratenotonlythatourmethods arefeasibleandevocativebutalsohowtheoutputgraphsprovideahelpfulandfamiliarframeworkthroughwhichresults canbefedbacktothescientist.

1.1. Data

Wedemonstratethatourmethodsuccessfullyidentifiesdependencebetweenclustersofgenes intwoexamples:a sim- ulatedrunningexampleandamoredetailedanalysisofagenuinemicroarrayexperimentontheplantArabidopsisthaliana.

These experiments were designedto detect genes whose expression levels, andhencefunctionality, might be connected withcircadianrhythms.Theaimwastoinvestigatethepossibleidentityandroleofthosegenesinvolvedintheregulation ofthecircadianclockofaplant. Forourrunningexamplewesimulated5clustersof30geneseach,ofthetype discussed in[5].HerethegeneexpressionwasmeasuredatT=13 timepointsovertwodays.Constantwhitelightwasshoneonthe plantsfor26hoursbeforethefirstmicroarraywastaken,withsampleseveryfourhours.Thus,therearetwocyclesofdata foreachoftheArabidopsismicroarraychip.Thedata yzi wassimulatedfrom

yi=λziB(θz1)Rziβ+εi

wherethesubscriptzi= jifindividualibelongstocluster jwith j=1,. . . ,5 and εiNormal(0,4).ThematrixBzi)isa Fourierbasisfunction(seeEquation(2)formoredetails)andRzi isarotationmatrix,thatis,ablockdiagonalmatrixwhere eachblockisamatrixM(kzi)as

M(k, θzi)=

cos(kθ )sin(kθ ) sin(kθ ) cos(kθ )

(3)

withkcorrespondingtotheindexoftheFouriercoefficients.Thegeneratingvaluesofλzi,θzi andβ aregiveninSection6.

Forthesecondexampleweanalysedthisdatasetinfull (22,810genes whoseexpressionsareeach observedover13time point)aswellasotherdatasets,aswediscussinSection6.

2. Clustermodelsforlongitudinaldata

ThestudyofthefamilyofregulatorygraphsproposedinthispaperbeganwithourworkonBayesianclustermodels.In theapplicationsoftheproposedmethodologythereare N usuallytensofthousandslongitudinalsetsofmeasurements oftheactivityofunitstakenatparticulardiscretepointsovertime.Tounderstandtheunderlyingprocessesitistherefore importanttolearnwhichoftheseprocessesarecopiesofeachother,asevidencedbythefactthat theirprofilesappearto bereplicates.Inthiswaywecanreducethetensofthousandsoftimeseriesintoamuchsmallernumberof K clusters.

TypicallyinourapplicationsK dependsonhowwesetthehyperparametersofourmodels.Howeverforourrealdataset thereareusuallyaboutonehundredlongitudinalclusters:astilllargebutmoremanageablenumber.Eachoftheselongitu- dinalclustersisthentreatedasacandidateregulator.Theunderlyingdependencestructurecanberepresentedgraphically anditcanbefurtherelaboratedtodescribehypothesesabouttheeffectsofthetypesofcontrolseludedtoabove.

Wefirstneedtosetupsomenotation.LetC{C1,C2, . . .CK}denoteapartitionofindices{1,2, . . . ,N}, NK,where Nk denotes the cardinalityof cluster Ck.Let Yi, i=1,2,. . . ,N, denotea set of N T-vectors ofreal valuedobservations of the continuous time development ofa unit at T given time points henceforth calleda profile. Let DC, the T ×Nj matrixofvaluesbedefined by DC= {Yi:iC}forsomesubset C of{1,2, . . . ,N}.Letthedistributionof{Y1,Y2, . . . ,YN} be parametrised by θ, andfixed hyperparameters ϕ.Let θ=1,θ2, . . .θK) whereθk,for k=1,2,. . . ,K isthe parametervectorspecific tothek-thclusterwithNk elements.NotethatthenK

k=1Nk=N.ThefirstBayesianmodelwe reviewisonewhichadmitsnodependenceorcausalstructurebetweenanyoftheobjectsofinteresttheclusterprofiles.

Definition1.ABayesclustermodelC ofasetofprofiles{Y1,Y2, . . . ,YN}assumesthat:

1. profilesindifferentclustersandtheirassociatedparametersareindependentgiventhehyperparametersi.e.

kK=1 DCkk

|ϕ

2. withineachclustertheprofilesYi|θk,ϕareidenticallydistributed,i=1,2,. . . ,Nk,k=1,2,. . . ,K andconditionallyon eachcluster’sparametervectorθkandthehyperparameters ϕ theseobservationsaretakenindependently:i.e.

Ni=k1Yi|θk,ϕ

TherenowareseveralsuchBayesianclustermodelswhichhavebeensuccessfullyandusefullyimplemented[7,16,5,17].

ManytaketheformthatforclusterCk thetimeseriesofobservationsismodelled byalinearregression

Y(ik)=X(k)β(k)+ε(ik) (1)

fork=1,. . . ,K andi=1,. . . ,Nk whereβ(k)∈Rp is thevector ofparameters with pT, whereε(ik) are independent identicallydistributedmeasurementerrorswithvariance σk2 andthedesignmatrix,orbasisfunction,X(k),amatrixofsize NkT×p,is customisedto the hypothesisedunderlying process. Inour running examplewhere interest liesin regulators actingoveratwentyfourhourcycle,anappropriatebasisfunctionisFourier.Sothen,forexamplewhen piseven,

y(itk)=β1(k)+

p/2

i=1

β2i(k)cos(2πt(2i)/T)+

p/2

i=1

β2i(k+)1sin(2πt(2i)/T)+εit(k). (2) Here theparameters θk associatedwithcluster Ck are thecorresponding regressionparameters βk andan errorvariance parameterlinkedtothewithinclusterdiversity.

Withintheseclassesitisoftenpossible,usinganappropriate conjugateanalysis,toconstructascorefunctionbasedon thecorrespondingBayesFactoraMAPscorewhichcanbederivedinclosedformconditionalonthehyperparametersfor eachpartitionoftheprofiles.ThusiferrorsandpriorparameterdistributionsareassumedGaussian,andthenoisevariance aprioriinverseGammadistributedthenconditionalonafixednoise-to-signalratiomatrix,themarginallikelihoodofeach possibleclusterpartition isaproduct ofmultivariate t densities.Forthedetails ofhow thesemethods workseee.g.[3,7, 17]and[10].Ithasnowbeenestablishedthat theseandrelatedmethodologiescansuccessfullyandfeasiblyidentifywell supportedmodelswithinthesetypicallyhugesearchspaces.

Themethodsdiscussed inthissection arewidelyusedintheliterature onclusteringofmicroarray experiments.Fig. 1 depictstheexpressionprofileswithin afewtypicalclusterstogether withgraphsoftheclusters’ meanFouriercoefficients intheanalysisgivenin[5],oneofthemanyanalysesperformedusingthismethodology.

Ourexperience usingthese methodshave demonstratedthat withthe carefulsettingof hyperparametersthe outputs ofsuch modelsare remarkablyrobust,both formallyandpractically, whenit comesto theestimationofparameters that mightbelinkedtoregulation[5,17,10,11].

(4)

Fig. 1.Fouroftheclustersobtainedclusteringthedatain[5],ourrealdataapplication.Thebluelinerepresentstheposteriormean,thethickercoloured linesarewellknowngenesandthebarplotontheleftaretheposteriorestimatesoftheFouriercoefficients.(Forinterpretationofthereferencestocolour inthisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)

3. Supraclusteringandregulatorymodels

Thistype ofcluster analysiscan beextended tomodeldirectlythedependences betweenclustersifaregulatory rela- tionship betweenthem canbespecifiedasfollows.Atransmissionfunctionisdefinedwhichmapstheparametersofone unit’s profileon tothe parameters oftheprofile it regulates.These functionsarethen coded asa groupof transmission matrices,customisedtomirrorthetypesofdependencesinducedbyaparticularhypothesisedregulatingmechanism.

In this section we fix the value of the cluster hyperparameter ϕ and for clarity suppressthis parameter in the indexing.Then forthetransitionhyperparametersφ,therearetwo groupsusedtomap theshapeofoneprofileonto another(L)andonecluster’sparametersontoanother(L).

Forsomeintegerm letL

L(φ):φ⊆Rm beafamilyofT×T transitionmatricesusedtodefinea regulatory relationshipfromaprofileYi1 ofagivenuniti1toasecondunitwiththeprofileYi2,wherethetransitionhyperparameter φ indexeseach matrixin thisfamily.AssociatedwithL isa set L

L(φ):φ⊆Rm of(p+1)×(p+1)matrices mappingtheparametersθj1,iCj1 totheparametersoftheregulatedunitθj2,i2Cj2,suchthat

θj2Lj1j2j1

ThefamilyoftransitionfunctionsLparametrisedbyφj1j2mustfirstbeelicitedfromthedomainexpertsincethismust be determinedfromthose featuresinthedata shebelieveswillbeindicativeofa regulatoryrelationship.Inour running examplesuch relationshipswerecommunicatedtous:“Thetime profileofexpressionofaregulatedgeneusually appears tobeashorttranslationoftheprofileoftheregulatingclusterofgenes,usuallyrescaledandsometimes,ifinhibitedrather than excited, reversed”.This determineda 3parametertransitionfunction witha parametercapturingthephase change, another expressing apositive rescaling anda third representinginhibition, which correspondedtoa simple signchange.

In thispaper, bothforsimplicity andbecausethe subclassseemed tofit well, we confinedourselves tosearch over this parameterclass,althoughotherofourmorecomplexmodelscouldalsomodeldampingeffects.Ofcourseinothersettings andwithdifferentinterpretationsofbasisfunctionsthegroupoftransformationscouldbeverydifferent.

For thepurposes of thispaperwe will assume, asinour runningexample, that both L and Lforma group under matrixmultiplication.Writei2i1 wherei2Cj2,i1Cj1 andCj1,Cj2C iffφj1j2suchthat

Références

Documents relatifs

But even the most basic document spanners from [25] can express all regular languages; and the core spanners that correspond to AQL can use word equality selections that allow

Under the Bayesian methodology for statistical inverse problems, the posterior distribution encodes the probability of the material parameters given the available

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

Quantum field theory

The idea of the proof of Theorem 2 is to construct, from most of the restricted partitions of n into parts in A , many unrestricted partitions of n.. In Section 4, we will prove

In Section 2 we study the relations among the characteristic parameters of a given word and those of the words obtained by adding a letter on its right or on its left.. The main

Another result concerns the counting of repetitions. By repetition of length m in a word w we mean any unordered pair of distinct occurrences of the same factor of length m in w.

The analysis of the obtained data shows that a slight deviation of the average surface temperature can significantly affect the population growth and activity of the