Publisher’s version / Version de l'éditeur:
Trends in Genetics, 34, 10, pp. 790-805, 2018-08-22
READ THESE TERMS AND CONDITIONS CAREFULLY BEFORE USING THIS WEBSITE.
https://nrc-publications.canada.ca/eng/copyright
Vous avez des questions? Nous pouvons vous aider. Pour communiquer directement avec un auteur, consultez la première page de la revue dans laquelle son article a été publié afin de trouver ses coordonnées. Si vous n’arrivez pas à les repérer, communiquez avec nous à PublicationsArchive-ArchivesPublications@nrc-cnrc.gc.ca.
Questions? Contact the NRC Publications Archive team at
PublicationsArchive-ArchivesPublications@nrc-cnrc.gc.ca. If you wish to email the authors directly, please see the first page of the publication for their contact information.
NRC Publications Archive
Archives des publications du CNRC
This publication could be one of several versions: author’s original, accepted manuscript or the publisher’s version. / La version de cette publication peut être l’une des suivantes : la version prépublication de l’auteur, la version acceptée du manuscrit ou la version de l’éditeur.
For the publisher’s version, please access the DOI link below./ Pour consulter la version de l’éditeur, utilisez le lien DOI ci-dessous.
https://doi.org/10.1016/j.tig.2018.07.003
Access and use of this website and the material on it are subject to the Terms and Conditions set forth at
Enter the matrix: factorization uncovers knowledge from omics
Stein-O’brien, Genevieve L.; Arora, Raman; Culhane, Aedin C.; Favorov,
Alexander V.; Garmire, Lana X.; Greene, Casey S.; Goff, Loyal A.; Li, Yifeng;
Ngom, Aloune; Ochs, Michael F.; Xu, Yanxun; Fertig, Elana J.
https://publications-cnrc.canada.ca/fra/droits
L’accès à ce site Web et l’utilisation de son contenu sont assujettis aux conditions présentées dans le site LISEZ CES CONDITIONS ATTENTIVEMENT AVANT D’UTILISER CE SITE WEB.
NRC Publications Record / Notice d'Archives des publications de CNRC:
https://nrc-publications.canada.ca/eng/view/object/?id=45003843-121a-4853-8aeb-16267d825c7f
https://publications-cnrc.canada.ca/fra/voir/objet/?id=45003843-121a-4853-8aeb-16267d825c7f
Review
Enter
the
Matrix:
Factorization
Uncovers
Knowledge
from
Omics
Genevieve
L.
Stein-O’Brien,
1,2,3Raman
Arora,
4Aedin
C.
Culhane,
5,6Alexander
V.
Favorov,
1,7Lana
X.
Garmire,
8Casey
S.
Greene,
9,10Loyal
A.
Goff,
2,3Yifeng
Li,
11Aloune
Ngom,
12Michael
F.
Ochs,
13Yanxun
Xu,
14and
Elana
J.
Fertig
1,*
Omicsdatacontainsignalsfromthemolecular,physical,andkineticinter-and
intracellularinteractionsthatcontrolbiologicalsystems.Matrixfactorization(MF)
techniques canreveallow-dimensionalstructure from high-dimensional data that
reflect these interactions. These techniques can uncover new biological
knowledgefromdiversehigh-throughputomicsdatainapplicationsrangingfrom
pathwaydiscoverytotimecourseanalysis.Wereviewexemplaryapplicationsof
MFforsystems-levelanalyses. We discussappropriate applications ofthese
methods,theirlimitations,andfocusontheanalysisofresultstofacilitateoptimal
biologicalinterpretation.TheinferenceofbiologicallyrelevantfeatureswithMF
enables discovery from high-throughput data beyond the limits of current
biologicalknowledge– answering questions fromhigh-dimensionaldatathat
wehavenotyetthoughttoask.
DeterminingtheDimensionsofBiologyfromOmicsData
High-throughputtechnologieshaveusheredinaneraofbigdatainbiology[1,2]and
empow-eredinsilicoexperimentationwhichispoisedtocharacterizecomplexbiologicalprocesses
(CBPs;seeGlossary)[3].Thenaturalrepresentationofhigh-dimensionalbiologicaldataisa
matrixofthemeasuredvalues(expressioncounts,methylationlevels,proteinconcentrations,
etc.)inrowsandindividualsamplesincolumns(Figure1).Columnscorrespondingto
experi-mentalreplicates,orsampleswithsimilarphenotypes,willhavevaluesfromthesame
distri-butionofbiologicalvariation.Relatedstructuresinthedataareobservedbecausetheyshare
oneormoreCBP.TheactivityofCBPsneednotbeidenticalineachsample.Inthesecases,the
valuesofallmolecularcomponentsthatareassociatedwithaCBPwillchangeproportionallyto
therelativeactivityofthatCBP.ThesephenotypesandCBPactivitiesareoftenunknowna
priori, requiring computational techniques for unsupervised learning to discover CBPs
directlyfromthebiologicaldata.
TherelationshipsbetweenCBPsandsimilaritiesbetweensamplesconstrainhigh-dimensional
datasetstohavelow-dimensionalstructure.Thenumberofgenes,proteins,andpathwaysthat
areconcurrentlyactivewithinanycellisconstrainedbyitsenergyandfree-moleculelimitations
[4].OnlyacharacteristicsubsetofCBPswillbeactiveinanycellatagiventime.Thus,fora
datasetwherecolumnsshareCBP,alow-dimensionalstructurecanbeextracted whichis
smallerthaneitherthenumberofrowsorthenumberofcolumns.
Matrixfactorization(MF)isaclassofunsupervisedtechniquesthatprovideasetofprincipled
approachestoparsimoniouslyrevealthelow-dimensionalstructurewhilepreservingasmuch
informationaspossiblefromtheoriginaldata.MFisalsoreferredtoasmatrixdecomposition,
Highlights
MFstechniquesinferlow-dimensional structurefromhigh-dimensionalomics data to enable visualization and inferenceofcomplex biological pro-cesses(CBPs).
DifferentMFsappliedtothesamedata willlearndifferentfactors.Exploratory dataanalysisshouldemploymultiple MFs, whereas a specific biological question should employ a specific MFtailoredtothatproblem. MFslearntwosetsoflow-dimensional representations(ineachmatrixfactor) fromhigh-dimensionaldata:one defin-ingmolecularrelationships(amplitude) and another defining sample-level relationships(pattern).
Data-drivenfunctionalpathways, bio-markers,andepistaticinteractionscan belearnedfromtheamplitudematrix. Clustering,subtypediscovery,insilico microdissection,andtimecourse ana-lysisareallenabledbyanalysisofthe patternmatrix.
MFenablesbothmulti-omicsanalyses andanalysesofsingle-celldata.
1DepartmentofOncology,Divisionof
BiostatisticsandBioinformatics, SidneyKimmelComprehensive CancerCenter,JohnsHopkinsSchool ofMedicine,Baltimore,MD,USA
2DepartmentofNeuroscience,Johns
HopkinsSchoolofMedicine, Baltimore,MD,USA
3McKusick-NathansInstituteof
GeneticMedicine,JohnsHopkins SchoolofMedicine,Baltimore,MD, USA
4DepartmentofComputerScience,
InstituteforDataIntensiveEngineering andScience,JohnsHopkins University,Baltimore,MD,USA
5
DepartmentofBiostatisticsand ComputationalBiology,Dana-Farber CancerInstitute,Boston,MA,USA
6DepartmentofBiostatistics,Harvard
THChanSchoolofPublicHealth, Boston,MA,USA
7VavilovInstituteofGeneralGenetics,
Moscow,Russia
8UniversityofHawaiiCancerCenter,
Honolulu,HI,USA
9DepartmentofSystems
PharmacologyandTranslational Therapeutics,PerelmanSchoolof Medicine,UniversityofPennsylvania, PA,USA
10ChildhoodCancerDataLab,Alex’s
LemonadeStandFoundation,PA, USA
11DigitalTechnologiesResearch
Centre,NationalResearchCouncilof Canada,Ottawa,ON,Canada
12SchoolofComputerScience,
UniversityofWindsor,Windsor,ON, Canada
13DepartmentofMathematicsand
Statistics,TheCollegeofNewJersey, Ewing,NJ,USA
14DepartmentofAppliedMathematics
andStatistics,WhitingSchoolof Engineering,JohnsHopkins University,Baltimore,MD,USA
*Correspondence:
ejfertig@jhmi.edu(E.J.Fertig).
and the corresponding inference problem as deconvolution. Other reviews discuss the
mathematicalandtechnicaldetailsofMFtechniques[5–8]andtheirapplicationstomicroarray
data[9].WefocushereonthebiologicalapplicationsofMFtechniquesandtheinterpretationof
their results since the advent of sequencing technologies. We describe a variety of MF
techniques appliedto high-throughputdata analysis, andcompareand contrast their use
forbiologicalinferencefrombulkandsingle-celldata.
WorkflowforMFAnalysis
Afterdatapreprocessing,mosthigh-throughputmoleculardatasetscanberepresentedasa
matrix in whicheach elementcontains the measurement of a single molecule ina single
Genes Samples ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Alignment and
quanficaon
Raw RNA-Seq
reads
Normalizaon and
log-transformaon
Matrix
factorizaon
Amplitude matrix
(molecular relaonships)Paern matrix
(sample relaonships)NMF
ICA
PCA
Fact or s Samples ... ... ... ... ... Genes Factors ... ... ... ... ...• Gene set discovery • Pathway analysis • Biomarker discovery • Clustering analysis • Subtype/subclone discovery • Timecourse analysis • Dependent signals • Nonnegave values • Independent signals • Maximal separaon of (orthogonal) signal
Figure1.OmicsTechnologiesYieldaDataMatrixThatCanBeInterpretedthroughMF.Thedatamatrixfrom
omicshaseachsampleasacolumnandeachobservedmolecularvalue(expressioncounts,methylationLevels,protein concentrations,etc.) asa row. Thisdata matrixispreprocessed with techniquesspecific toeach measurement technology,andistheninputtoamatrixfactorization(MF)techniqueforanalysis.MFdecomposesthepreprocessed datamatrixintotworelatedmatricesthatrepresentitssourcesofvariation:anamplitudematrixandapatternmatrix.The rowsoftheamplitudematrixquantifythesourcesofvariationamongthemolecularobservations,andthecolumnsofthe patternmatrixquantifythesourcesofvariationamongthesamples.Abbreviations:ICA,independentcomponentanalysis; NMF,non-negativematrixfactorization;PCA,principalcomponentanalysis.
experimentalcondition.IntheexampleofRNAsequencing(RNA-seq),thenumberofshort
readsfromeachgenearesummarizedintogenelevelcounts.Theresultinghigh-dimensional
datasetisformulatedbyrepresentingthesegenelevelcountsforeachsampleasacolumnin
thedatamatrix(Figure1).MFmethodscanthenbeappliedtothesecountmatricestolearn
CBPsfromthedata.ManyMFtechniquesdescribedareforRNA-seqdatathatare
prepro-cessedusinglogtransformation[10]ormodelsofsequencingdepth[11],whileothersdirectly
modelreadcounts[12].
MostexamplesfeaturedinthisreviewarebasedonsuchpreprocessedRNA-seqanalysisof
genecountsorlog-transformedgenecounts.WenotethatapplicationsofMFarenotlimitedto
thisdatamodalityorthispreprocessingpipeline.Forexample,MFhasalsobeenappliedto
define mutational signatures in cancer [13,14], allele combinations in phenotypes [15],
transcriptregulationofgenes[16],anddistributionsoftranscriptlengths[17],as wellasto
discriminate peptides in mass spectrometry proteomics [18]. To apply MF to other data
modalities,they must also beproperlypreprocessed into adatamatrix witha distribution
appropriatetotheMFanalysismethod.
Whenappliedtohigh-throughputomicsdata,MFtechniqueslearntwomatrices:onedescribes
thestructure between features (e.g.,genes) and anotherdescribes the structure between
samples(Figure2,KeyFigure).Wecalltheformerfeature-levelmatrixtheamplitudematrix
andthelattersample-levelmatrixthepatternmatrix.Additionaltermshavebeencoinedfor
theamplitudeandpatternmatricesbasedupontheMFproblemappliedandonthespecific
applicationto high-throughput biological data (Box 1). Thevalues in each column of the
amplitudematrix are continuousweightsdescribingthe relative contribution ofa molecule
ineachinferredfactor.IncaseswherefactorsdistinguishbetweenCBPs,therelativeweightsof
thesemoleculescanbeassociatedwithfunctionalpathways.Thesamemoleculemayhave
high values inmultiplecolumns of theamplitude matrix. Thus,MF techniques are able to
accountforthecumulativeeffectofgenesthatparticipateinmultiplepathways.
Whereaseachcolumnoftheamplitudematrixdescribestherelativecontributionsofmolecules
toafactor,eachrowofthepatternmatrixdescribestherelativecontributionsofsamplestoa
factor(Figure2).Samplegroupscanbelearnedbycomparingtherelativeweightsineachrow
ofthepatternmatrix.ThepatternmatrixfromMFcanalsobebinarizedtoperformclustering
[19,20]orkeptascontinuousvaluestodefinerelationshipsbetweensamples[21–23].Inthe
same way as molecules with high weights within a column of the amplitude matrix are
associatedwithacommonpathway,sampleswithhighweightswithinarowofthepattern
matrixcanbeassumedtoshareacommonphenotypeorCBP.
Theoptimalnumberofcolumnsoftheamplitudematrixandrowsofthepatternmatrixisoften
referredtoasthedimensionalityoftheMFproblem,andlearningthisvalueremainsanopen
problem for the MF research community [24,25]. We also note that MF is not a single
computationalmethod.Instead,thereisawidebodyofliteratureonnumerousMFtechniques
thathavebeenappliedthroughoutcomputationalbiology.Thepropertiesofboththeamplitude
andpatternmatrices,andsubsequentlytheinterpretationoftheirvalues,dependcruciallyon
thespecificMFproblemandthealgorithmselectedforanalysis.
MFTechniques:PCA,ICA,andNMF
There are numerous approaches to MF. The three most prominent MF approaches are
principal component analysis (PCA), independent component analysis (ICA), and
non-negative matrix factorization (NMF). Each of these techniques has a distinct
Glossary
Amplitudematrix:thematrix learnedfromMFthatcontains moleculesinrowsandfactorsin columns.Eachcolumnrepresents therelativecontributionsofthe genestoafactor,andthesecanbe usedtodefineamolecularsignature foraCBP.
Complexbiologicalprocess (CBP):thecoregulationor coordinatedeffectofmultiple molecularspeciesresultinginoneor morephenotypes.Examplescan rangefromactivationofmultiple proteinsinasinglecellularsignaling pathwaytoepistaticregulationof development.
Computationalmicrodissection:a computationalmethodtolearnthe compositionofaheterogeneous sample,forexamplethecelltypesin atissuesample.
Independentcomponentanalysis (ICA):anMFtechniquethatlearns statisticallyindependentfactors. Matrixfactorization(MF):a techniquetoapproximateadata matrixbytheproductoftwo matrices(Box1),oneofwhichwe calltheamplitudematrixandthe otherthepatternmatrix.
Non-negativematrixfactorization (NMF):anMFtechniqueforwhichall elementsoftheamplitudeand patternmatricesaregreaterthanor equaltozero.
Patternmatrix:thematrixlearned fromMFthatcontainsfactorsin rowsandsamplesincolumns.Each rowrepresentstherelative contributionsofthesamplestoa factor,andthesecanbeusedto definetherelativeactivityofCBPsin eachsample.
Principalcomponentanalysis (PCA):anMFtechniquethatlearns orthogonalfactorsorderedbythe relativeamountofvariationofthe datathattheyexplain.
Unsupervisedlearning:
computationaltechniquestodiscover featuresfromhigh-dimensionaldata, withoutrelianceonpriorknowledge oflow-dimensionalcovariates.
mathematicalformulationofadistinctMFproblem(asdescribedinthesupplemental
informa-tiononlineandinotherreviews[5,8,26–29]).
Briefly,PCAfindsdominantsourcesofvariationinhigh-dimensionaldatasets,inferringgenes
thatdistinguishbetweensamples.Maximizing thevariabilitycapturedinspecificfactors,as
opposedtospreadingrelativelyevenlyamongfactors,maymixthesignalfrommultipleCBPsin
KeyFigure
The
Matrix
Product
of
the
Amplitude
and
Pattern
Matrices
Approxi-mates
the
Preprocessed
Input
Data
Matrix
(A)
=
×
SamplesD
ata
FactorsA
mplitude
Factors SamplesP
aern
Rows of connuous or binary weights associate with samples (each sample is a column) capturing relaonships including cell-types/lines, paents, or experimental condions
Columns of genec, epigenecs, or protein weights (each row is a unique molecule) associated with a given sample feature and oen reflecve of co-regulaon.
(A)
Factors
Samples
Associated molecular signatures (metagenes, modules, etc.)
●●
Magnitude of rows of the
‘paern’ matrix scaled from
0 Max(P) (B)
Factors
Molecules of interest
Paerns in groups of samples
Molecules of interest
Molecules of interest, i.e. genes
Figure2.Thenumberofcolumnsoftheamplitudematrixequalsthenumberofrowsinthepatternmatrix,andrepresents
thenumberofdimensionsinthelow-dimensionalrepresentationofthedata.Ideally,apairofonecolumnintheamplitude matrixandthecorrespondingrowofthepatternmatrixrepresentsadistinctsourceofbiological,experimental,and technicalvariationineachsample(calledcomplexbiologicalprocesses,CBPs).(B)Thevaluesinthecolumnofthe amplitudematrixthenrepresenttherelativeweightsofeachmoleculeintheCBP,andthevaluesintherowofthepattern matrixrepresentitsrelativeroleineachsample.Plottingofthevaluesofeachpatternforapre-determinedsample grouping(hereindicatedbyyellow,grey,andblue)inaboxplotasanexampleofavisualizationtechniqueforthepattern matrix.Abbreviation:Max(P),maximumvalueofeachrowofthepatternmatrix.
a single component. Therefore, PCA may conflate processes that sometimes occur and
complicate interpretationof theamplitude matrix fordefining data-drivengenesets orthe
inferenceofspecificCBPs.
ICAandNMFlearndistinctprocessesfromaninputdatamatrixusingdifferenttechniques.ICA
learnsfactorsthatarestatistically independent,resultinginmoreaccurate associationwith
literature-derivedgenesets[30–32].NMFmethodsconstrainallelementsoftheamplitudeand
patternmatricestobegreaterthanorequaltozero[33,34].WhereasthefeaturesinPCAcanbe
rankedbytheextenttowhichtheyexplainthevariationinthedata,thefeaturesinbothICAand
NMFareassumedtohaveequalweight.NMFiswellsuitedtotranscriptionaldata,whichis
typicallynon-negativeitself,andsemi-NMFisalsoapplicabletodatathatcanhavenegative
values.Theassumptionsof NMFmodelboth theadditivenatureof CBPsand parsimony,
generatingsolutionsthatarebiologicallyintuitivetointerpret[35].
The solutions from both ICA and NMF may vary dependingupon the initialization of the
algorithm,leadingtodisparateamplitudeandpatternmatrices.Therefore,itiscrucialtoensure
thatparticularsolutionusedforanalysisprovidesanoptimalandrobustsolutionbeforeusing
theresultsofthefactorizationtointerpretCBPs.WepreviouslyfoundthatBayesiantechniques
tosolveNMFhavemorerobustamplitudematricesthangradient-basedtechniques,andthus
generatemoreaccurateassociationsbetweenthevaluesintheamplitudematrixandfunctional
pathways [5,36]. Other studies have found that gradient-based approaches have similar
computationalperformancetotheirBayesiancounterparts[37],andnewtechniquesarebeing
developedtoenhancethestabilityofthefactorization[38].Furtherintegratedcomputational/
experimentalinvestigation isnecessarytoassessthebiologicalrelevance ofsolutionsfrom
bothclassesoftechniquesandtheirrobustness.Theseassociationsalsodependcruciallyon
theinput data. Therefore,to learnCBPs from data, MF must beapplied to datasetswith
sufficientmeasurementsoftheexperimentalperturbationsorconditionsrelevanttothespecific
biologicalproblembeingaddressed.
SampleApplicationtoGenotype-TissueExpression(GTEx)ProjectGene ExpressionData ofPostmortemTissues
ApplyingmultipletypesofMFtechniquestothesamedatasetorasingleMFtodistinctsubsets
ofadatasetcanalsofinddistinctsourcesofvariability.SelectingwhichMFmethodtouseto
learnrelevantCBPsfromadatasetisthencrucial,anddevelopingstandardizedmetricsto
Box1.CommonTerminologyintheLiterature
Historically,theindependentdiscoveryofMFinmultiplefieldsincludingmathematics,computerscience,andstatistics createddistinctterminologiesthatareoftenusedinterchangeablyasanalyticalorthologsingenomics.Forexample,the term‘factorization'isoftenusedinterchangeablywith‘decomposition'.Otherterms,suchas‘features',‘components', ‘latentvariable',or‘latentfactors'areusedtorefertorelationshipsbetweeneithermolecularmeasurementsorsamples, dependingonthecontext.
Thespecificterminologyfortheamplitudeandpatternmatricesalsovariesaccordingtothemethod,andarepreferably labeledwithdifferentvariablenames.InPCA,theamplitudematrixisoftencalledthescoreorrotationmatrix,and patternmatrixcalledtheloadings.InICA,theamplitudematrixiscalledtheunmixingmatrixandthepatternmatrixis calledthesourcematrix.InNMF,theamplitudematrixiscommonlycalledtheweightsmatrixandthepatternmatrixis calledthefeaturesmatrix.
Throughoutthispaperweuse‘amplitudematrix'torefertothematrixthatcontainsvectorswhichrepresent relation-shipsbetweenmolecularmeasurements.Othertermsusedintheliteratureforthesemolecularrelationshipsinclude modules,meta-pathways,andsignatures.Similarly,weuse‘patternmatrix'torefertothematrixthatcontainsvectors whichrepresentrelationshipsbetweensamples.Otherliteraturetermsforthesesample-levelrelationshipsinclude patterns,metagenes,eigengenes,sources,andcontrollingfactors.
choosebetweenthemisanactiveareaofresearchincomputationalbiology.Toillustratethe
differencesbetweenMFmethods,weapplyPCA,ICA(CRANpackagefastICA[39]),andNMF
(R/BioconductorpackageCoGAPS[40,41])toasingledatasetfrompostmortemsamplesin
theGTExproject[42](Figure3).Specifically,weselectasubsetofGTExdatacontaining12
braintissuesintheGTExdataforsevenindividualsforwhichwehadpreviouslyperformedonly
NMFanalysis[41](codesareprovidedinthesupplementalinformationonline).Weselectthis
problembecausetheCBPs(tissueandindividual)arereadilyseparatedandareknownapriori,
providingaknowngroundtruthtofacilitatecomparisonofmethods.
FirstweapplyPCAtothissampleGTExdataset(Figure3A).Thecomponentslearnedfrom
PCAcanberankedbytheamountofvariationthattheyexplaininthedata,withthefirsttwo
componentsexplaining89.6%ofthevariationinthisdataset.Theamountofvariationexplained
−0.1 0.1 0.3 0.5 −0.1 0.1 0.3 PC 1 PC 2
PCA colored by Ɵssue type
−0.1 0.1 0.3 0.5 −0.1 0.1 0.3 PC 1 PC 2 PCs
PCA colored by GTeX subject
1 5 9 Propor on of v ar iance 0.0 0.2 0.4 0.6
Scree plot of PCs vs variance
(A) (C) −0.10 −0.05 0 0.05 0.10 0.15 Ammon's hor n Am ygdala Ante rior cingulate Caudate n ucleus Cerebellum Dor solater al pfc Cer vical spinal cord
Frontal cor tex Hypotha lam us Nucleus accu mb ens Pu tam en Substan a nigr a −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 NMF blue Iden Ɵ ty pa Ʃ erm 0.0 0.2 0.4 0.6 0.8 1.0 Cerebellum NMF pa Ʃ erm 0.0 0.2 0.4 0.6 0.8 1.0 NMF yellow Iden Ɵ ty pa Ʃ erm
ICA cerebellum paƩern
Ammon's ho rn Am ygdala Anter ior cingulate Caudate n ucleus Cerebellum Dorsolate ral pfc Cer vical spinal cordFrontal cor
tex Hypothalam us Nucleus accumbens Putamen Substan a nigr a
ICA IdenƟty paƩern
(B) Ammon's hor n Am ygdala Anter ior cingulate Caudate nucleus Cerebellum Dorsolater al pfc Cer vical spinal cordFrontal cor
tex Hypothalam us Nucleus accumbens Putamen Substan a nigr a Ammon's hor n Am ygdala Anter ior cingulate Caudate n ucleus Cerebellum Dorsolater al pfc Cer vical spinal cordFrontal cor
tex Hypothalam us Nucleus accumbens Putamen Substan a nigr a
Figure3.ComparisonofPatternMatrixFromMatrixFactorization(MF)inPostmortemTissueSamplesfromGTEx.(A)PCAfindsfactorsinrowsofthe
patternmatrixthatcanberankedbytheamountofvariationthattheyexplaininthedata,asillustratedinascreeplot.PCAanalysestypicallyplotthefirsttwoprincipal components(PCs;rowsofthepatternmatrix)toassesssampleclustering.PointsarecoloredbytissuetypeannotationsfromGTEx(left),whereAmmon’shornrefersto thehippocampus,anddonor(right).InGTExdata,thecerebellum(lightblue)andfirstcervicalspinalcord(yellow)clusterseparatelyfromallotherbraintissues,butno separationbetweenindividualsisobserved.(B)ICAfindsfactorsassociatedwithindependentsourcesofvariation,andthereforecannotberankedinascreeplot.The relativeabsolutevalueofthemagnitudeofeachelementinthepatternmatrixindicatestheextenttowhichthatsamplecontributestothecorrespondingsourceof variation.Thesignofthevaluesindicateover-orunderexpressioninthatfactordependingonthesignofthecorrespondinggeneweightsintheamplitudematrix.Asa result,thevaluescanbeplottedontheyaxisagainstknowncovariatesonthexaxistodirectlyinterprettherelationshipbetweensamples.WhenappliedtoGTEx,we observeonepatternassociatedwithcerebellum,anotherpatternthathaslargepositivevaluesforonedonorandlargenegativevaluesforanotherdonor,andeight otherpatternsassociatedwithothersourcesofvariation(supplementalinformationonline).(C)NMFfindsfactorsthatarebothnon-negativeandnotrankedbyrelative importance,similarlytoICA.Thevalueofthepatternmatrixindicatestheextenttowhicheachsamplecontributestoaninferredsourceofvariationandisassociated withoverexpressionofcorrespondinggeneweightsintheamplitudematrix.ValuesofthepatternmatrixcanbeplottedsimilarlytoICA.WhenappliedtoGTEx,we observeonepatternassociatedwithcerebellum,twomorepatternsassociatedwiththetwodonorsthatwereassignedtoasinglepatterninICA,andsevenother patternsassociatedwithother sourcesofvariation(supplementalinformationonline).Abbreviations:GTEx,Genotype-TissueExpression (GTEx)project;ICA, independentcomponentanalysis;NMF,non-negativematrixfactorization;PCA,principalcomponentanalysis.
bythesecomponentscanbeusedtodeterminetheoptimaldimensionalityforPCA[24].A
typical PCA analysis explores the association between these components and biological
covariatesinlow-dimensionalplotsinwhicheachaxisisdefinedbytheweightsinonerow
ofthepatternmatrix(Figure3A).Similarlytoclusteringanalyses,theseplotscanbeusedto
determinewhichbiologicalfeaturescanbeseparatedfromthedata.Inthisapplication,we
observethatthecerebellum(lightblue)andfirstcervicalspinalcord(yellow)clusterseparately
fromallotherbraintissues(PC1andPC2,orrows1and2ofthepatternmatrix,respectively).
NoseparationbetweenindividualsisobservedinthisPCAanalysis.
IncontrasttoPCA,thecomponentsofICAandNMFcannotberankedbypercentvariation
explained.Instead,eachrowofthepatternmatrix isan equallyimportantCBPinthedata.
Therefore,thesepatternsareplottedindependentlyrelativetobiologicalcovariates,andnot
relativetooneanotherasinPCAanalyses.WhenappliedtothesampleGTExdataset,rowsof
the pattern matrix from both ICA (Figure 3B) and NMF (Figure 3C) also distinguish the
cerebellum, similarly to PCA. Whereas PC1 has large positive values for the cerebellum
andlargenegativevaluesfortheotherbraintissues,bothNMFandICAhavelargeabsolute
valueonlyforthetissueofinterestandarenearzeroforothertissues.Asaresult,bothICAand
NMFprovidetissue-specificpatternsandPCAprovidestissue-segregatingpatterns.Wenote
thatthe magnitude ofall thesepatternsisunit-less, reflectingthe relative weightsofeach
sampleinthepatternandnotameasurablequantity.Becausegeneweightscanbeeither
positiveornegativeinICA,thepatternsweightscanalso.Bycontrast,byconstructiontheNMF
valuesareallnon-negative.Asaresult,thepatterncorrespondingtothecerebellumsamples
fromICAisstillspecifictothattissue,withgenesoverexpressedinthistissuehavingnegative
weightsinthecorrespondingcolumnoftheamplitudematrix,andgenesunderexpressedin
thattissuehavingpositiveweights.Indeed,thegeneweightsinthecolumnoftheamplitude
matrixcorrespondingtotheNMFcerebellumpatternaresignificantlyanti-correlatedwiththe
geneweightsinthe columnof the amplitudematrix corresponding to the ICAcerebellum
pattern(R= 0.72,P<210 16).
Incontrast to PCA, both ICAand NMFalso infer patternsthat distinguish individualswith
commonweightsacrossalltissues.IntheNMFanalysis,eachindividualhasaseparatepattern.
TheICAanalysishasasinglepatternthathaslargepositivevaluesforoneoftheindividuals
distinguishedinoneNMFpattern,andlargenegativevaluesforanotherindividualdistinguished
indifferentNMFpattern.Thisdiscrepancyhighlightsthedifferencebetweeninferring
indepen-dentsourcesofvariationwithICAandNMF.Specifically,ICAmaycombinemultipleCBPsusing
commongeneswhosesignchangesbyexperimentalcondition,whereasNMFwillfindCBPs
thatareadditive,correspondingonlytooverexpressionofgenesinthatcondition.
PCA,ICA,andNMFareequallyvalid,andtheirdistinctformulationgivesrisetothedistinct
featuresobservedinthedata.ApplicationsofmultipletypesofMF techniques,oreventhe
sameMFalgorithmwithdifferentparameters,mayinferseveralCBPsorphenotypeswithina
singledataset,inessenceprovidinganswerstodifferentquestions.Wefurthernotethatthe
specifictechniquesforPCA,ICA,andNMFselectedforanalysisinthisexampleareonlyoneof
amultitudeofvariantsoftechniqueswhichhavebeendevelopedforcomputationalbiology.
ApplyingmultipletypesofMFtechniquestothesamedataset,orasingleMFtodistinctsubsets
ofadataset,canalsofinddistinctsourcesofvariability.However,thegeneralpropertiesof
thesemethodswill remainandbeconsistent withourexample.Briefly,weobserveinthis
examplethatPCAfindssourcesofseparationinthedata,whereasbothICAandNMFfind
independentsourcesofvariation.ICAcanfindbothover-andunderexpressionofgenesina
maybettermodelbothrepressionandactivationthanNMF,butasasideeffectmayhave
greatermixtureofCBPsthanNMF.
Regardlessofthetechniqueselected,theresultswillalsobesensitivetotheinputdata.For
example,adifferentNMF-classalgorithmcalled‘gradeofmembership'(GOM)wasalsoappliedto
alargersetofpostmortemsamplesinGTEx.Thisalgorithmfoundapatternthatcombinedall
samplesfrombrainregionswhenappliedtoalltissuesamplesinGTEx,butseparatedthedistinct
brainregionswhenappliedonlytotissuesamplesfromthebrain.Thus,applyingmultipletypesof
MFtechniquestothesamedataset,orasingleMFtodistinctsubsetsofadataset,canalsofind
distinctsourcesofvariabilitythatareessentialforexploratorydataanalysis.
FurtherExampleApplicationsToRepresentCellTypes,DiseaseSubtypes, PopulationStratification,TissueComposition,andTumorClonalitywiththe PatternMatrix
ExactlyasweobservedintheGTExexample,asinglefactorizationofcomplexdatasetscan
findmultipledistinctsourcesofvariation.Forexample,thepowerofMFtoidentifymultiple
sourcesofvariationwas seenwhen multipletechnicalfactorsfromsampleprocessingand
biologicalfactorswerediscoveredinanICAofgeneexpressionprofilesof198bladdercancer
samples[43].OnefactorinthepatternmatrixofthisanalysisdefinedaCBPassociatedwith
gender.BecauseICAsimultaneouslyaccountsformultiplefactorsinthedataasseparaterows
inthesamematrix,eachrowcanfullydistinguishasinglebiologicalgroupingfromthedata.
AnalysisofasingledatasetwithoneMFalgorithmusingdifferentnumbersoffactorscanreflect
ahierarchyofbiologicalprocesses.Forexample,applyingCoGAPStodatafromasetofhead
andnecktumorsandnormalcontrolsfor arangeofdimensionalitieswasable toseparate
tumorandnormalsampleswhenlimitedtotwopatterns,butfurtherdecomposedthetumor
samplesintothetwodominantclinicalsubtypesofheadandneckcancerwhenidentifyingfive
patternsforthesamedata[44].Thehierarchicalrelationshipbetweenpatternshasbeenused
toassesstherobustnessofpatternstoquantifytheoptimalityofthefactorization[45]andlearn
theoptimaldimensionalityofthefactorization[46].Otheralgorithmsusestatisticalmetricsto
estimatethenumberoffactors[12,47].Whilethesealgorithmsquantifyfittothedata,theymay
disregard thehierarchical nature ofdistinct CBPs learned byfactoring biological datainto
multipledimensions.Thisobservationhighlightsthecomplexityofestimatingthenumberof
factorsforoptimalMFanalysisofbiologicaldata(seeOutstandingQuestions).
Moreover,applicationofdifferentMFalgorithmstothesamedatasetcangivedifferentsample
groupingsthatreflectbiology.Forexample,inpopulationgeneticsagroupinginferredfrom
GWASwhichdistinguishesancestryisequallyvalidtoagroupinginferredfromthesameGWAS
datawhichdistinguishesdiseaserisk.TheapplicationofPCAtoSNPdatafrom3000European
individuals[48]demonstratesinferenceofsamplecorelationshipsusingthepatternmatrix,and
foundthatmuchofthevariationinDNAsequenceisexplainedbythelongitudeandlatitudeof
thecountryoforiginofanindividual.Inaddition,statisticalmodelscanbeformulatedassuming
thattheinheritanceofanindividualarisesfromproportionsofancestryindistinctpopulations
throughgenetic admixture[48].AnMF-based techniquecalledsparse factoranalysis also
distinguishesbetweenthesepopulationsusingGWASdata[49].Theseanalysesdemonstrate
thattheancestryofeachindividualisadominantsourceofvariationinDNAsequence.Atthe
sametime,sourcesofvariationinGWASdataarisefromvariantsthatgiverisetodiseaserisk,
whichcanbesharedamongindividualswithdiversegeneticbackgrounds[50].Inthesameway
asweobservedinourGTExexample,theapplicationofmultipleMFtechniquesisessentialto
MixturesofcelltypesinbiologicalsamplesintroduceafurtherdegreeofcomplexitytoMFanalysis
ofbiological variation in theirmolecular data. Computationalmicrodissectionalgorithms
estimatetheproportionofdistinctcelltypeswithinabulksamplebyapplyingMFtogeneswhose
expressionisuniquelyassociatedwitheachcelltype[51].Subsettingthedatatodifferentgenes
maygiverisetodifferentfactorsthatrepresentdifferentCBPs.Nonetheless, CoGAPSNMF
analysisofdatasubsetsthatwereobtainedbyselectingequallysizedsetsofrandomgenesfound
thatthepatternmatriceswereconsistentforeachrandomgenesetintheexpressiondata[41,52].
TheseresultssuggestthatthedependencyofanMFonthespecificgenesusedforanalysismay
dependontheheterogeneityofthesignalinthedatamatrix.
Cellularandmolecular heterogeneityposesa particularchallengeto MF analysisincancer
genomics.Evenapuretumortissuecancontainnumeroussubclonesowingtothe
accumula-tionofdifferentdrivereventsduringtumorevolution.NewMFtechniqueshavebeendeveloped
toestimatetheproportionofthetumorthatarisesfromeachsubclone[47,53–56].
Assump-tionsabouttheevolutionarymechanismsoftheaccumulationofmolecularalterationscanalso
beencodedinthefactorizationtomodeltheresultingheterogeneityoftheseclones[12,47].
Thesestudiesdemonstrate thatencodingpriorknowledgeinto MFcan focustheresulting
factorstoreflectoneoftheequallyvalidbiologicalgroupingswithinthedata.
FromSnapshotstoMovingPictures:SimplifyingTimecourseAnalysis
EntwinedinthechallengeofdecomposingcelltypesandsubpopulationsisthefactthatCBPs
change over time. High-throughput timecourse datasets are emerging in the literature to
accountforthe dynamics ofbiologicalsystems.Thecentralgoal oftimecourse analysisis
todeterminetheextenttowhichmoleculeschangeovertimeinresponsetoperturbations(e.g.,
developmental time, environmentalfactors, disease processes,or therapeutictreatments).
Associating molecular alterations often relies on specialized bioinformatics techniques for
timecourseanalysis[57,58].MFanalysescannaturallyinferchangesinCBPsovertimewhen
appliedtotimecoursedatabecausethecontinuousweightsforeachsampleinthepattern
matrixcanvaryamongsamplescollectedacrossdistincttimepoints.Therelativeweightsof
rowsofthepatternmatrixcanencodethetimingofregulatorydynamicsdirectlyfromthedata
(Figure4).Nonetheless,mostMFalgorithmsfortimecourseanalysisdonotencodetheknown
timepointsorretaintheirrelativeordering.Methodsthatspecificallyusethesetemporaldataare
currentlyanactiveareaofresearch.
Complex biological process (CBP) captured in sample features, for example:
Samples
Magnitude of rows of the ‘pa
ern’ matrix scaled from
0 Max(P)
P1 P2 P3
Pa ern matrix (NMF, mecourse)
d6 ... d1 ... d6d1 ... d6 Time d1 1 2 3 4 5 6 P1 1 2 3 4 5 6 P2 1 2 3 4 5 6 P3 Days
Pa erns in groups of samples
Group
Figure4.SamplesCorrespondtoTimepoints;theRowsofthePatternMatrixCanBePlottedasaFunction
ofTimeandSampleConditionToInfertheDynamicsofComplexBiologicalProcesses(CBPs).
Abbrevia-tions:d1–d6,days1–6;max(P),maximumvalueofeachrowofthepatternmatrix;NMF,non-negativematrixfactorization; P1–3,patterns1–3.
BothICAand NMFwere found to have signaturescharacterizing the yeastcellcycle and
metabolisminearlytimecoursemicroarrayexperiments[59,60].ThesparseNMFtechniques
usingBayesian methodshadpatternsthatreflectedthesmoothdynamicsofthesephases
[36,59]. This approach has been shown to simultaneously learn pathway inhibition and
transitoryresponsesto chemicalperturbationofcancercells [61]andrelatethechangesin
phospho-proteomictrajectories betweenmultipletherapies [62].Similar analysis ofhealthy
braintissueslearnedthedynamicsoftranscriptionalalterationsthatarecommontotheaging
processinmultipleindividuals[52].MFtechniquesdesignedforcancersubclonesdescribedin
theprevioussectionhavealsobeenappliedtorepeatsamplestolearnthedynamicsofcancer
development, thereby elucidating the molecular mechanisms that give rise to therapeutic
resistanceandmetastasis.Eveniftherearethesamenumberofbiologicalfeatures,therate
or timing of related features in different molecular modalities may be offset [63]. These
discrepanciesbydatamodalitysuggestthatdifferentregulatorymechanismsmaybe
respon-sibleforinitiatingandstabilizingthemalignantphenotype[63].
Data-DrivenGeneSetsfromMFProvideContext-Dependent Coregulated GeneModulesandPathwayAnnotations
Genomicdataareofteninterpretedbyidentifyingmolecularchangesinsetsofgenesannotatedto
functionally related modules orpathways, called gene sets[64,65]. Often the associations
betweengenesetsandfunctionsarebaseduponmanualcurationoftheliterature[66,67].Such
set-level interpretations often lack important contextual information [64,68,69], and cannot
describegenesofunknownfunctionorgenesassociatedwithnewfunctionalmechanisms.
TheamplitudematrixfromMFanalysiscanbeusedbothforliterature-basedgene-setanalysis
andtodefinenewdata-drivengenesignatures(Figure5).Standardgene-setanalysiscanbe
applieddirectlytothevaluesineachcolumnoftheamplitudematrixtoassociatetheinferred
factorswithliterature-curatedsets.New,context-dependentgenesetscanalsobelearned
fromthevaluesintheamplitudematrix.Gene-setannotationsareoftenbinary.Thresholding
techniquesto selectwhichgenesbelongtoapathwayfromtheamplitudematrixforbinary
membershipprovideanoutputsimilartogenesetsindatabases[70,71].Otherstudiesalso
integratethe literature-derivedgene signaturesinthesethresholdsto refine the contextof
pathwaydatabases[36,72].Thegenesderivedfromthesebinarizationscanbeusedasinputs
topathwayanalysesfromdifferentialexpressionstatisticsinindependentdatasets(Figure5,
right),andare analogoustothe hierarchicalclustering-based genemodules[73] andgene
expressionsignaturesfrom publicdomain studies in theMSigDB gene-set database[74].
Anothermeansofbinarizingthedataistofindgenesthataremostuniquelyassociatedwitha
specificpatterntouseasbiomarkersofthecelltypeorprocessassociatedwiththatpattern
[41,75].SelectinggenesbaseduponthesestatisticscanfacilitatevisualizationoftheCBPsin
high-dimensionaldata[41].Whereasbinarizationofgeneswithhighweightscanassociatea
singlegenewithmultipleCBPs,thestatisticsforuniqueassociationslinkagenewithonlyone
CBP.Therefore,thesestatisticsalsodefinespecificgenesthatmaybebiomarkersofthecell
type/stateoraprocess[41](Figure5,right).
Althoughbinarypathwaymodelsaresubstantiallyeasiertointerpret,continuousvaluesfrom
theoriginalfactorizationprovideabettermodeloftheinputdata.Weightedgenesignatures
have been shown to be more robust to noise and missingvalues inthe data [76]. If the
expressionlevelofageneispoorlymeasuredinasample,othergenesinthesamefactorcan
implytheactualexpressionlevelofthe geneinquestion. Byconsideringeach geneinthe
continuoussignaturescanbeassociateddirectlywithothersamplesusingprojectionmethods
[76,77]orprofilecorrespondencemethods[78].
MFEnablesUnbiasedExplorationofSingle-CellData forPhenotypesand MolecularProcesses
MF approachesareanatural choice insingle-cell RNA-seq(scRNA-seq) dataanalysis owing to the
highdimensionalityofthedata,andareusedtoidentifyandremovebatcheffects,summarize
CBPs,andannotatecelltypesinthedata[79–82].WhereasMFanalysisofbulkdatadissects
groupsfromasmallsubsetofsamples,theanalysisinscRNA-seqdataaggregatescellsinto
groupsofcommoncelltypesorCBPs[75,83].Oftentheseanalysesareperformedonasubsetof
thedatacontainingthemostvariablegenes.Newercomputationallyefficientmethodsarebeing
developedtoenablefactorizationoflargeomicsdatasetsforgenome-wideanalysis[84].
Biologi-calknowledgecanbeencodedwithaclassofMFalgorithmsthatsummarizefactorsusinggene
sets[79,85,86].
Features
A priori func onally annotated sets
Molecules of interest, (i.e,. genes)
Molecules of interest, (i.e,. genes)
P < 0.001 P < 0.001
WT
KO
Enrichment analysis can be
done directly or via thresholding
Unique ‘pa
Ʃernmarkers’ can be
used for biological valida
Ɵon
Up Do wn 1.00 0.69 0.57 0.37 0.19 −0.02 −0.17 −0.30 −0.39 −0.51 −0.95 0 4.1 Enr ichment
Features
Projec
Ɵon into new data
Molecules of interest, (i.e,. genes)
New samples
Amplitute matrix
amplitude matrix
Thresholded
Figure5.TheAmplitudeMatrixfromMatrixFactorization(MF)CanBeUsedtoDeriveData-DrivenMolecularSignaturesAssociatedwithaComplex
BiologicalProcess(CBP).ThecolumnsoftheamplitudematrixcontaincontinuousweightsdescribingtherelativecontributionofamoleculetoaCBP(centerpanel;
indicatedbytheorange,purple,andgreenboxes).Theresultingmolecularsignaturecanbeanalyzedinanewdatasettodeterminethesamplesinwhicheach previouslydetectedCBPoccurs,andtherebyassessfunctioninanewexperiment.Thiscomparisonmaybedonebycomparingthecontinuousweightsineachcolumn oftheamplitudematrixdirectlytothenewdataset(left).Theamplitudematrixmayalsobeusedintraditionalgene-setanalysis(right).Traditionalgene-setanalysisusing literaturecuratedgenesetscanbeperformedonthevaluesineachcolumnoftheamplitudematrixtoidentifywhetheraCBPisoccurringintheinputdata.Data-driven genesetscanalsobedefinedfromthismatrixdirectlyusingbinarization,andusedinplaceofliterature-curatedgenesetstoqueryCBPsinanewdataset.Setsdefined frommoleculeswithhighweightsintheamplitudematrixcomprisesignaturesakintomanycuratedgene-setresources,whereasmoleculesthataremostuniquely associatedwithaspecificfactor(purplebox)maybebiomarkers.Abbreviations,KO,knockout;WT,wildtype.
MostMFtechniquesdevelopedforbulkomicsdataassumethatthegeneexpressionchanges
fromCBPsareadditive.ThisassumptionisviolatedinscRNA-seqdata.Onereasonforthe
violationoftheadditiveassumptioninMFistheinabilitytodistinguishtruezerosfrommissing
values.Imputation methods for preprocessing [81,87] ornewer MF algorithms that model
missingdataareessentialforscRNA-seqdata.Branchingoftrajectoriesofcellularstatesand
lackofcell-cyclesynchronizationinscRNA-seqdatafurtherviolatetheadditiveassumptionin
MF.Newnonlinearfactorizationtechniquesarebeingdevelopedtoenhancevisualizationof
trajectorystructuresinsingle-celldata[80,88–90]inthesecases.However,theresultsfrom
these methods cannot be interpreted in the same way as those from MF algorithms. In
particular,thelow-dimensionalsolutionsfromthesemethodsarenotnecessarilyuseableto
reconstructtheoriginalhigh-dimensionaldata.
ConcludingRemarks
MF encompasses a versatile class of techniques with broad applications to unsupervised
clustering,biological pattern discovery,component identification, andprediction. Since MF
wasfirstappliedtomicroarraydataanalysisintheearly2000s[59,91–93],thebreadthofMF
problemsandalgorithmsforhigh-throughputbiologyhasgrownwiththeirbroadapplications.MF
problemsareubiquitousinthecomputationalsciences,withexamplesincludingunsupervised
featurelearning[94–99],clusteringandmetriclearning[100–102],latentDirichletallocation[103],
subspacelearning[104–109],multiviewlearning[110],matrixcompletion[111],multitasklearning
[112],semi-supervisedlearning[113],compressedsensing[114],andsimilarity-basedlearning
[115,116].DimensionreductionofbiologicaldatawithMFhighlightsperspectivesandquestions
thatinvestigatorshavenotyetconsidered,andalsoenablestractableexplorationofotherwise
massivedatasets.Asthesizeofthesedatasetsgrow,itiscrucialtodevelopnewalgorithmsto
solveMFproblemsthatscalewiththeever-increasingsizeofomicsdata[37,41,117,118].MF
algorithmscanalsobeextendedforsimultaneousanalysisofdatafrommultipledatamodalities,
enablinggenomicdataintegration[7,8,119].TechniquesthatextendthisintegratedMF
frame-work,includingBayesiangroupfactoranalysis[8]andtensordecomposition[120–123],canalso
analyzedatasetsacrossdifferentmolecularlevels[124].Developingsuchdata-integration
tech-niquesisanactiveareaofresearchinbothgenomicsandcomputationalsciences.
DifferentclassesoftechniquessolveMF,includinggradient-basedandprobabilisticmethods
(supplementalinformationonline).DistinctMFproblemseachaimtoidentifyspecifictypesof
features.Insomecasesdifferentalgorithmswilllearndistinctfeaturesfromthesamedataset.
Therefore,investigatorsmaybenefitfromapplyingmultipletechniqueswithdifferentproperties,or
bycarefullyconsideringboththedatasetandthequestioninselectingexactlytherighttechnique
forthatquestion.Mostsuchcomparisonsintheliteraturehavebeenmadebyinvestigatorswho
aredevelopingMFmethods.UnbiasedassessmentsoftherelativeperformanceofdifferentMF
algorithmsfordifferentexploratorydataanalysisproblemsareessentialtodeterminetherelative
strengthsandweaknessesofeachmethodfordistinctbiologicalproblems.MFalgorithmscanbe
furthertailoredtothebiologicalproblemofinterestusingmethodsthatalsoencodepriorbiological
knowledgeofthesystemunderlyingthemeasureddataset[125–127].
ThefeaturesMFtechniquesextractareconstrainedbythedatasetusedtotrainthem.These
algorithms cannot learn unmeasured features, nor can they correct for complete overlap
betweentechnical artifactsand biological conditions. Thus, being mindful of experimental
design when selecting datasetsand choosing those that are broad enough to cover the
relevant sourcesof variability isessential. Advances in MF and related techniques will be
essentialforpoweringsystems-levelanalysesfrombigdata(seeOutstandingQuestions).
OutstandingQuestions
Howcantheoptimalnumberoffactors bequantified?Increasingthenumber offactorsinMFimprovesits approxi-mation, but may cause overfitting. Computational metrics that balance thesepropertiescanbeusedto esti-matethenumberoffactors. Biologi-callydrivenmetricsarealsoessential toassesswhetherthisnumber repre-sentsthenumberofCBPs. Howcandifferentfactorizationsbe com-pared?DifferentMFalgorithmsidentify differentpropertiesandCBPsfroma sin-gledataset.Newtechniqueswillbe nec-essary to compare and merge the disparatebutequallyvalidfactorizations. What techniques are best for efficientand optimalfactorization?Largeomics data-sets are common, particularly with single-cell technologies. Fast computational approachestoMFarerequiredforthe analysisofthesedata.Thesealgorithms also require computational criteria to ensurethatthesolutionsarerobust.
Howcanfullynonlinearfactorizationsbe performedonomics-scaledata? Nonlin-earextensionstoMFhavenotable appli-cationstosingle-celldata.Researchers oftenuselow-dimensional representa-tionsofomicsdatalearnedfromlinear MFasinputstononlinearMFtoaccount forbothcomputationalefficiencyandthe underlyingmathematicalassumptionsof thesenonlinearmethods.Breakthroughs intheoriesfortechniquessuchaskernel MF,deepMF,andmanifoldlearningwill enablefullynonlinearfactorizationsfor large-scalesingle-celldata.
Howcanbothgeneregulatory relation-shipsanddistinctsourcesoftechnical variation be encoded in integrative analysis ofdata fromdifferent mea-surement technologies? Different measurementtechnologiesyielddata thatfollowuniquedistributions.CBPs and their timing also vary between typesofmeasures.Integrative techni-quesmustaccountforboththe tech-nicalandbiologicalsourcesofvariation between disparate data modalities, includingsparseormissingdata.Such techniquesarecrucialtolearningthe regulatory relationships that drive CBPsandtomodelingthemultiscale nature of biological systems from omicsdata.
Acknowledgments
WethankOrlyAlter,DavidBerman,J.BrianByrd,MichaelLove,IreneGallegoRomero,LillianFritz-Laylin,Luciane Kagohara,LouiseKlein,CraigMak,MatthewStephens,DanielaWitten,andothermembersof‘NewPISlack’fortheir insightfulfeedback.ThisworkwassupportedbytheNationalInstitutesofHealth(NIH)NationalCancerinstitute(NCI)and the National Libary of Medicine (NLM) (grants NCI 2P30CA006516-52 and 2P50CA101942-11 to A.C.C., NCI R01CA177669E.J.F.,NCIU01CA212007toE.J.F.,NLMR01LM011000toM.F.O.,andNCIP30CA006973),Johns HopkinsUniversityCatalystandDiscoveryAwardstoE.J.F.,aJohnsHopkinsUniversityInstituteforDataIntensive EngineeringandScience(IDIES)AwardtoE.J.F.andR.A.,aJohnsHopkinsSchoolofMedicineSynergyawardtoE.J.F. andL.A.G.,agrantfromTheGordonandBettyMooreFoundation(GBMF4552)toC.S.G.,Alex’sLemonadeStand Foundation’sChildhoodCancerDataLab(C.S.G.),awardK01ES025434fromtheNationalInstituteofEnvironmental HealthSciences(NIEHS)throughfundsprovidedbythetrans-NIHBigDatatoKnowledge(BD2K)initiative(L.X.G.),award P20COBREGM103457fromtheNIH/NationalInstituteofGeneralMedicalSciences(NIGMS;toL.X.G.),R01LM012373 fromtheNLM(L.X.G.),R01HD084633fromtheNationalInstituteofChildHealthandHumanDevelopment(NICHD;toL.X. G.),theDepartmentofDefenseBreastCancerResearchProgram(BCRP;awardBC140682P1,A.C.C.),theNational ScienceandEngineeringCouncilofCanada(NSERC;DGgrantnumberRGPIN-2016-05017,A.N.),theWindsor-Essex CountyCancerCentreFoundation(Seeds4Hope grant814221,A.N.),HopkinsinHealth andBooz AllenHamilton (90056858)toY.X.,theRussianFoundationforBasicResearch(KOMFI17-00-00208)andNIH(NCIP30CA006973) toA.V.F.,andtheNationalResearchCouncilofCanadatoY.L.Thisprojecthasbeenmadepossibleinpartbygrants 2018-183444(E.J.F.),2018-128827(L.A.G.),and2018-182718(C.S.G.)fromtheChanZuckerbergInitiative Donor-AdvisedFund(DAF),anadvisedfundoftheSiliconValleyCommunityFoundation.Theviewsandopinionsof,and endorsementsby,theauthor(s)donotreflectthoseoftheUSArmyortheDepartmentofDefense.
SupplementalInformation
Supplementalinformationassociatedwiththisarticlecanbefound,intheonlineversion,athttps://doi.org/10.1016/j.tig.
2018.07.003.
References
1. Bell, G. etal.(2009) Beyond the data deluge. Science323, 1297–1298
2. Sagoff,M.(2012)Datadelugeandthehumanmicrobiome
project.IssuesSci.Technol.28, http://issues.org/28-4/
sagoff-3/
3. Alter,O.(2006)Discoveryofprinciplesofnaturefrom mathe-maticalmodelingofDNAmicroarraydata.Proc.Natl.Acad.Sci. U.S.A.103,16063–16064
4. Heyn,P.etal.(2015)Intronsandgeneexpression:cellular constraints, transcriptional regulation, and evolutionary conse-quences. Bioessays37, 148–154
5. Ochs, M.F. and Fertig, E.J. (2012) Matrix factorization for transcriptional regulatory network inference. IEEE Symp. Comput. Intell. Bioinforma. Comput. Biol. Proc. 2012, 387–396
6. Abdi, H. etal.(2013) Multiple factor analysis: principal compo-nent analysis for multitable and multiblock data sets. WIREs Comp.Stat.5, 149–179
7. Meng, C. etal.(2016) Dimension reduction techniques for the integrative analysis of multi-omics data. Brief.Bioinform.17, 628–641
8. Li, Y. etal.(2016) A review on machine learning principles for multi-view biological data integration. Brief.Bioinform.19, 325– 340
9. Devarajan,K.(2008)Nonnegativematrixfactorization:an ana-lyticalandinterpretivetoolincomputationalbiology.PLoS Com-put.Biol.4,e1000029
10. Ritchie,M.E.etal.(2015)limmapowersdifferentialexpression analysesforRNA-sequencingandmicroarraystudies.Nucleic AcidsRes.43,e47
11. Anders, S. and Huber, W. (2010) Differential expression analysis for sequence count data. GenomeBiol.11, R106
12. Xie,F. etal. (2017)BayCount:aBayesiandecomposition
method for inferring tumor heterogeneity using RNA-Seq
counts.bioRxivPublishedonlineNovember13,2017.http://
dx.doi.org/10.1101/218511
13. Alexandrov, L.B. etal.(2013) Signatures of mutational pro-cesses in human cancer. Nature500, 415–421
14. Alexandrov,L.B.etal.(2016)Mutationalsignaturesassociated withtobaccosmokinginhumancancer.Science354,618–622
15. Favorov,A.V.etal. (2005)AMarkovchainMonteCarlo techniqueforidentificationofcombinationsofallelicvariants underlyingcomplex diseases inhumans. Genetics171, 2113–2121
16. Zakeri, M. etal.(2017) Improved data-driven likelihood factori-zations for transcript abundance estimation. Bioinformatics33, i142–i151
17. Bertagnolli, N.M. etal.(2013) SVD identifiestranscript length distribution functions from DNA microarray data and reveals evolutionary forces globally affecting GBM metabolism. PLoS One8, e78913
18. Peckner,R.etal.(2017)Specter:lineardeconvolutionasanew
paradigmfortargetedanalysisofdata-independentacquisition
mass spectrometry proteomics. bioRxiv Published online
September8,2017.http://dx.doi.org/10.1101/152744
19. Yeung, K.Y. etal.(2001) Model-based clustering and data trans-formations for gene expression data. Bioinformatics 17, 977–987
20. Jiang, D. etal.(2004) Cluster analysis for gene expression data: asurvey.IEEETrans.Knowl.DataEng.16,1370–1386
21. Venet,D.etal.(2001)Separationofsamplesintotheir con-stituents using gene expression data. Bioinformatics 17, S279–S287
22. Abbas,A.R.etal.(2009)Deconvolutionofbloodmicroarraydata identifiescellularactivationpatternsinsystemiclupus erythe-matosus. PLoSOne4, e6098
23. Erkkilä, T. etal.(2010) Probabilistic analysis of gene expression measurements from heterogeneous tissues. Bioinformatics26, 2571–2577
24. Leek, J.T. (2011) Asymptotic conditional singular value decom-position for high-dimensional genomic data. Biometrics67, 344–352
25. Kelton, C.J. etal.(2015) The estimation of dimensionality in gene expression data using nonnegative matrix factorization. 2015 IEEEInternationalConferenceonBioinformaticsand Biomedi-cine(BIBM)1642–1649
26. Meng, C. etal.(2016) Dimension reduction techniques for the integrative analysis of multi-omics data. Brief.Bioinform.17, 628–641
27. Berry, M.W. etal.(2007) Algorithms and applications for approx-imate nonnegative matrix factorization. Comput.Stat.DataAnal. 52, 155–173
28. Wang, Y.-X. andZhang, Y.-J. (2013) Nonnegative matrix factorization:acomprehensivereview.IEEETrans.Knowl.Data Eng.25,1336–1353
29. Zhou,G.etal.(2014)Nonnegativematrixandtensor factoriza-tions:analgorithmicperspective.IEEESignalProcess.Mag.31, 54–65
30. Lee, S.-I. and Batzoglou, S. (2003) Application of independent component analysis to microarrays. GenomeBiol.4, R76
31. Engreitz, J.M. etal.(2010) Independent component analysis: mining microarray data for fundamental human gene expression modules. J.Biomed.Bioinf.43, 932–944
32. Teschendorff, A.E. etal.(2007) Elucidating the altered transcrip-tional programs in breast cancer using independent component analysis. PLoSComput.Biol.3, e161
33. Ochs, M.F. etal.(1999) A new method for spectral decomposi-tion using a bilinear Bayesian approach. J.Magn.Reson.137, 161–176
34. Lee, D.D. and Seung, H.S. (1999) Learning the parts of objects by non-negative matrix factorization. Nature401, 788–791
35. Moloshok, T.D. etal.(2002) Application of Bayesian decompo-sitionforanalysingmicroarraydata.Bioinformatics18,566–575
36. Kossenkov,A.V.etal.(2007)Determiningtranscriptionfactor activityfrommicroarraydatausingBayesianMarkovchain MonteCarlosampling. Stud.HealthTechnol.Inform.129, 1250–1254
37. Mairal,J.etal.(2010)Onlinelearningformatrixfactorizationand sparse coding. J.Mach.Learn.Res.11, 19–60
38. Wu, S. etal.(2016) Stability-driven nonnegative matrix factori-zation to interpret spatial gene expression and build local gene networks. Proc.Natl.Acad.Sci.U.S.A.113, 4290–4295
39. Hyvärinen, A. and Oja, E. (2000) Independent component anal-ysis: algorithms and applications. NeuralNetw.13, 411–430
40. Fertig, E.J. etal.(2010) CoGAPS: an R/C++ package to identify patterns and biological process activity in transcriptomic data. Bioinformatics26, 2792–2793
41. Stein-O’Brien,G.L. etal.(2017) PatternMarkers & GWCoGAPS for novel data-driven biomarkers via whole transcriptome NMF. Bioinformatics33, 1892–1894
42. Dey, K.K. etal.(2017) Visualizing the structure of RNA-seq expression data using grade of membership models. PLoS Genet.13, e1006599
43. Biton, A. et al. (2014) Independent component analysis uncoversthelandscapeofthebladdertumortranscriptome andrevealsinsights intoluminalandbasal subtypes.Cell Rep.9,1235–1245
44. Fertig,E.J.etal.(2013)Preferentialactivationofthehedgehog pathwaybyepigeneticmodulationsinHPVnegativeHNSCC identifiedwith meta-pathway analysis. PLoSOne8, e78127
45. Bidaut, G. etal.(2010) Interpreting and comparing clustering experiments through graph visualization and ontology statistical enrichment with the ClutrFree Package. In BiomedicalInformatics forCancerResearch(Ochs, M.F., ed.), pp. 315–333,Springer
46. Bidaut, G. etal.(2006) Determination of strongly overlapping signaling activity from microarray data. BMCBioinformatics7, 99
47. Xu, Y. etal.(2015) MAD Bayes for tumor heterogeneity –feature allocation with exponential family sampling. J.Am.Stat.Assoc. 110, 503–514
48. Novembre, J. etal.(2008) Genes mirror geography within Europe. Nature456, 98–101
49. Engelhardt, B.E. and Stephens, M. (2010) Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis. PLoSGenet.6, e1001117
50. McCarthy, M.I. etal.(2008) Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat. Rev.Genet.9, 356–369
51. Hackl, H. etal.(2016) Computational genomics tools for dis-secting tumour-immune cell interactions. Nat.Rev.Genet.17, 441–458
52. Fertig,E.J.etal.(2014)Patternidentificationintime-coursegene expressiondatawiththeCoGAPSmatrixfactorization.Methods Mol.Biol.1101,87–112
53. Nik-Zainal,S.etal.(2012)Thelifehistoryof21breastcancers. Cell149,994–1007
54. Roth, A. etal.(2014) PyClone: statistical inference of clonal population structure in cancer. Nat.Methods11, 396–398
55. Deshwar, A.G. etal.(2015) PhyloWGS: reconstructing subclo-nal composition and evolution from whole-genome sequencing of tumors. GenomeBiol.16, 35
56. Lee, J. etal.(2016) Bayesian inference for intratumour hetero-geneity in mutations and copy number variation. J.R.Stat.Soc. Ser.C.Appl.Stat.65, 547–563
57. Bar-Joseph, Z. etal.(2012) Studying and modelling dynamic biological processes using time-series gene expression data. Nat.Rev.Genet.13, 552–564
58. Liang, Y. and Kelemen, A. (2017) Dynamic modeling and
network approaches for omics time course data: overview of
compu-tationalapproachesandapplications.Brief.Bioinform.Published
onlineApril18,2017.http://dx.doi.org/10.1093/bib/bbx036
59. Moloshok,T.D.etal.(2002)ApplicationofBayesian decom-positionforanalysingmicroarraydata.Bioinformatics18, 566–575
60. Liebermeister,W.(2002)Linearmodesofgeneexpression determinedbyindependentcomponentanalysis.Bioinformatics 18, 51–60
61. Ochs, M.F. et al. (2009) Detection of treatment-Induced changes in signaling pathways in gastrointestinal stromal tumors using transcriptomic data. CancerRes.69, 9125–9132
62. Hill, S.M. etal.(2016) Inferring causal molecular networks: empirical assessment through a community-based effort. Nat. Methods13, 310–318
63. Stein-O’Brien,G.etal.(2017)Integratedtime-courseomics
analysisdistinguishesimmediatetherapeuticresponsefrom
acquiredresistance.bioRxivPublishedonlineAugust1,2017.
http://dx.doi.org/10.1101/136564
64. Khatri, P. etal.(2012) Ten years of pathway analysis: current approaches and outstanding challenges. PLoSComput.Biol.8, e1002375
65. Irizarry, R.A. etal.(2009) Gene set enrichment analysis made simple. Stat.MethodsMed.Res.18, 565–575
66. Bauer-Mehren,A.etal.(2009)Pathwaydatabasesandtoolsfor theirexploitation:benefits,currentlimitationsandchallenges. Mol.Syst.Biol.5,290
67. Tsui,I.F.L.etal.(2007)Publicdatabasesandsoftwarefor thepathwayanalysisofcancergenomes.CancerInform.3, 379–397
68. The GTEx Consortium (2015) The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science348, 648–660
69. Tan, J. etal.(2017) Unsupervised extraction of stable expres-sion signatures from public compendia with an ensemble of neural networks. CellSyst.5, 63–71
70. Kim, J.W. etal.(2017) Decomposing oncogenic transcriptional signatures to generate maps of divergent cellular states. Cell Syst.5, 105–18.e9
71. Fertig, E.J. etal.(2012) Gene expression signatures modulated by epidermal growth factor receptor activation and their relationship to cetuximab resistance in head and neck squa-mous cell carcinoma. BMCGenomics13, 160
72. Fertig, E.J. etal.(2012) Identifying context-specifictranscription factor targets from prior knowledge and gene expression data. IEEETrans.Nanobioscience12, 142–149
73. Segal, E. etal.(2004) A module map showing conditional activity of expression modules in cancer. Nat.Genet.36, 1090–1098
74. Subramanian, A. etal.(2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expressionprofiles.Proc.Natl.Acad.Sci.102,15545–15550
75. Zhu,X.etal.(2017)Detectingheterogeneityinsingle-cell RNA-Seqdatabynon-negativematrixfactorization.PeerJ.5,e2888
76. DeTomaso,D.andYosef,N.(2016)FastProject:atoolfor low-dimensionalanalysisofsingle-cellRNA-Seqdata.BMC Bioin-formatics17, 315
77. Fertig, E.J. etal.(2013) Identifying context-specifictranscription factor targets from prior knowledge and gene expression data. IEEETrans.Nanobiosci.12, 142–149
78. Irizarry, R.A. etal.(2005) Multiple-laboratory comparison of microarray platforms. Nat.Methods2, 345–350
79. Fan, J. etal.(2016) Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat. Methods3, 241–244
80. Trapnell, C. etal. (2014) The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat.Biotechnol.32, 381–386
81. Townes,F.W.etal.(2017)Varying-censoring awarematrix
factorizationforsinglecellRNA-sequencing.bioRxivPublished
onlineJuly21,2017.http://dx.doi.org/10.1101/166736
82. Moon,K.R.etal.(2017)PHATE:adimensionalityreduction
methodforvisualizingtrajectorystructuresinhigh-dimensional
biologicaldata.bioRxivPublishedonlineMarch24,2017.http://
dx.doi.org/10.1101/120378
83. Puram,S.V.etal.(2017)Single-velltranscriptomicanalysisof primary and metastatic tumor ecosystems in head and neck cancer. Cell171, 1611–1624
84. Hübschmann,D.etal.(2017)Decipheringprogramsof
tran-scriptionalregulationbycombineddeconvolutionofmultiple
omics layers. bioRxiv Published online October 8, 2017.
http://dx.doi.org/10.1101/199547
85. Buettner, F. etal.(2017) f-scLVM: scalable and versatile factor analysis for single-cell RNA-seq. GenomeBiol.18, 212
86. Buettner,F.etal.(2016)Scalablelatent-factormodelsappliedto
single-cellRNA-seqdataseparatebiologicaldriversfrom
con-foundingeffects.bioRxivPublishedonlineNovember15,2016.
http://dx.doi.org/10.1101/087775
87. vanDijk,D.etal.(2017)MAGIC:adiffusion-basedimputation
methodreveals gene-geneinteractions in single-cell
RNA-sequencing data. bioRxiv Published online February 25,
2017.http://dx.doi.org/10.1101/111591
88. Risso,D.etal.(2017)ZINB-WaVE:ageneralandflexiblemethodfor
signalextractionfromsingle-cellRNA-seqdata.bioRxivPublished
onlineNovember2,2017.http://dx.doi.org/10.1101/125112
89. Pierson,E.andYau,C.(2015)ZIFA:Dimensionalityreductionfor zero-inflatedsingle-cellgeneexpressionanalysis.GenomeBiol. 16, 241
90. van der, M.L. and Hinton, G. (2008) Visualizing data using t-SNE. J.Mach.LearnRes.9, 2579–2605
91. Alter, O. etal.(2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc.Natl. Acad.Sci.U.S.A.97, 10101–10106
92. Fellenberg, K. etal.(2001) Correspondence analysis applied to microarray data. Proc.Natl.Acad.Sci.U.S. A. 98, 10781–10786
93. Brunet, J.P. etal.(2004) Metagenes and molecular pattern discovery using matrix factorization. Proc.Natl.Acad.Sci. U.S.A.101, 4164–4169
94. Abdi, H. and Williams, L.J. (2010) Principal component analysis. WIREsComp.Stat.2, 433–459
95. Hyvärinen, A. etal.(2004) IndependentComponentAnalysis, John Wiley & Sons
96. Hardoon, D.R. etal.(2004) Canonical correlation analysis: an overview with application to learning methods. Neural.Comput. 16, 2639–2664
97. Schölkopf, B. etal.(1999) Kernel principal component analysis. In AdvancesInKernel Methods– SupportVectorLearning (Schölkopf, B., ed.), MIT Press, pp. 327–000
98. Arora,R.andLivescu,K.(2012)KernelCCAformulti-view learningofacousticfeaturesusingarticulatorymeasurements. InSymposiumonMachineLearninginSpeechandLanguage Processing.MSLP
99. Andrew,G.etal.(2013)Deepcanonicalcorrelationanalysis. Proceedingsofthe30thInternationalConferenceonMachine Learning1247–1255
100.Ding, C. and He, X. (2004) K-means clustering via principal component analysis. Proceedingsofthe21stInternational Con-ferenceonMachineLearning29
101.Arora, R. etal. (2011) Clustering by left-stochastic matrix factorization. Proceedingsofthe28thInternationalConference onMachineLearning28, 761–768
102.Kulis, B. (2013) Metric learning: a survey. Found.TrendsMach. Learn.5, 287–364
103.Pritchard, J.K. etal.(2000) Inference of population structure using multilocus genotype data. Genetics155, 945–959
104.De la Torre, F. and Black, M.J. (2003) A framework for robust subspace learning. Int.J.Comput.Vis.54, 117–142
105.Szeliski,R.(2010)ComputerVision:AlgorithmsandApplications.
http://szeliski.org/Book/drafts/SzeliskiBook_20100903_draft.pdf
106.Candès,E.J.etal.(2011)Robustprincipalcomponentanalysis? J.ACM58,11
107.Arora,R.etal.(2012)StochasticoptimizationforPCAandPLS. In50thAnnualAllertonConferenceonCommunicationControl, andComputing,pp.861–868,Allerton
108.Arora, R. etal.(2013) Stochastic optimization of PCA with capped MSG. In AdvancesinNeuralInformationProcessing Systems(Vol. 26) (Burges, C.J.C., ed.), In pp. 1815–1823, Curran Associates
109.Goes,J.etal.(2014)Robuststochasticprincipalcomponent
analysis. In Proceedings of the Seventeenth International
ConferenceonArtificialIntelligenceandStatistics(Kaski,S.
andCorander,J.,eds),pp.266–274,PMLR
110.Bickel, S. and Scheffer, T. (2004) Multi-view clustering. Proceedingsof theIEEE InternationalConferenceonData Mining19–26
111.Candès, E.J. and Recht, B. (2009) Exact matrix completion via convex optimization. FoundComut.Math.9, 717
112.Argyriou, A. etal.(2007) Multi-task feature learning. Adv.Neural. Inf.Process.Syst.19,41–48
113.Ando,R.K.andZhang,T.(2005)Aframeworkforlearning predictivestructuresfrommultipletasksandunlabeleddata. J.Mach.Learn.Res.6,1817–1853
114.Cleary,B.(2017) Compositemeasurementsandmolecular
compressedsensingforhighlyefficienttranscriptomics.bioRxiv
PublishedonlineJanuary2,2017.http://dx.doi.org/10.1101/
091926
115.Aha, D.W. etal.(1991) Instance-based learning algorithms. Mach.Learn.6, 37–66
116.Arora, R. etal.(2013) Similarity-based clustering by left-stochas-tic matrix factorization. J.Mach.Learn.Res.14, 1715–1746
117.Liao, R. etal.(2014) CloudNMF: a MapReduce implementation of nonnegative matrix factorization for large-scale biological datasets. GenomicsProteomicsBioinf.12, 48–51
118.de Campos, C.P. etal.(2013) Discovering subgroups of patients from DNA copy number data using NMF on compacted matri-ces. PLoSOne8, e79720
119.Huang, S. etal.(2017) More is better: recent progress in multi-omics data integration methods. Front.Genet.8, 84
120.Hore, V. etal.(2016) Tensor decomposition for multiple-tissue gene expression experiments. Nat.Genet.48, 1094–1100
121.Durham, T.J. etal.(2018) PREDICTD parallel epigenomics data Imputation with cloud-based tensor decomposition. Nat. Commun.9, 1402
122.Zhu, Y. etal.(2016) Constructing 3D interaction maps from 1D epigenomes. Nat.Commun.7, 10812
123.Wang,M.etal.(2017)Three-wayclusteringofmulti-tissue
multi-individual gene expression data using constrained tensor
decomposition.bioRxivPublishedonlineDecember5,2017.
http://dx.doi.org/10.1101/229245
124.Kolda, T. and Bader, B. (2009) Tensor decompositions and applications. SIAMRev.51, 455–500
125.Mao, W. et al. (2017) Pathway-level information extractor (PLIER):a
generativemodelforgeneexpressiondata.bioRxivPublished
onlineDecember162017.http://dx.doi.org/10.1101/116061
126.Hofree, M. etal.(2013) Network-based stratificationof tumor mutations. Nat.Methods10, 1108
127.Liao, J.C. etal.(2003) Network component analysis: recon-struction of regulatory signals in biological systems. Proc.Natl. Acad.Sci.U.S.A.100, 15522–15527