Enter the matrix: factorization uncovers knowledge from omics

(1)

Publisher’s version / Version de l'éditeur:

Trends in Genetics, 34, 10, pp. 790-805, 2018-08-22

READ THESE TERMS AND CONDITIONS CAREFULLY BEFORE USING THIS WEBSITE.

https://nrc-publications.canada.ca/eng/copyright

Vous avez des questions? Nous pouvons vous aider. Pour communiquer directement avec un auteur, consultez la première page de la revue dans laquelle son article a été publié afin de trouver ses coordonnées. Si vous n’arrivez pas à les repérer, communiquez avec nous à PublicationsArchive-ArchivesPublications@nrc-cnrc.gc.ca.

Questions? Contact the NRC Publications Archive team at

PublicationsArchive-ArchivesPublications@nrc-cnrc.gc.ca. If you wish to email the authors directly, please see the first page of the publication for their contact information.

NRC Publications Archive

Archives des publications du CNRC

This publication could be one of several versions: author’s original, accepted manuscript or the publisher’s version. / La version de cette publication peut être l’une des suivantes : la version prépublication de l’auteur, la version acceptée du manuscrit ou la version de l’éditeur.

For the publisher’s version, please access the DOI link below./ Pour consulter la version de l’éditeur, utilisez le lien DOI ci-dessous.

https://doi.org/10.1016/j.tig.2018.07.003

Access and use of this website and the material on it are subject to the Terms and Conditions set forth at

Enter the matrix: factorization uncovers knowledge from omics

Stein-O’brien, Genevieve L.; Arora, Raman; Culhane, Aedin C.; Favorov,

Alexander V.; Garmire, Lana X.; Greene, Casey S.; Goff, Loyal A.; Li, Yifeng;

Ngom, Aloune; Ochs, Michael F.; Xu, Yanxun; Fertig, Elana J.

https://publications-cnrc.canada.ca/fra/droits

L’accès à ce site Web et l’utilisation de son contenu sont assujettis aux conditions présentées dans le site LISEZ CES CONDITIONS ATTENTIVEMENT AVANT D’UTILISER CE SITE WEB.

NRC Publications Record / Notice d'Archives des publications de CNRC:

https://nrc-publications.canada.ca/eng/view/object/?id=45003843-121a-4853-8aeb-16267d825c7f

https://publications-cnrc.canada.ca/fra/voir/objet/?id=45003843-121a-4853-8aeb-16267d825c7f

(2)

Review

Enter

the

Matrix:

Factorization

Uncovers

Knowledge

from

Omics

Genevieve

L. _{Stein-O’Brien,}

1,2,3

Raman

Arora,

4

Aedin

C. Culhane,

5,6

Alexander

V. Favorov,

1,7

Lana

X. Garmire,

8

Casey

S. Greene,

9,10

Loyal

A. Goff,

2,3

Yifeng

Li,

11

Aloune

Ngom,

12

Michael

F. Ochs,

13

Yanxun

Xu,

14

and

Elana

J. Fertig

1,

*

Omicsdatacontainsignalsfromthemolecular,physical,andkineticinter-and

intracellularinteractionsthatcontrolbiologicalsystems.Matrixfactorization(MF)

techniques canreveallow-dimensionalstructure from high-dimensional data that

reﬂect these interactions. These techniques can uncover new biological

knowledgefromdiversehigh-throughputomicsdatainapplicationsrangingfrom

pathwaydiscoverytotimecourseanalysis.Wereviewexemplaryapplicationsof

MFforsystems-levelanalyses. We discussappropriate applications ofthese

methods,theirlimitations,andfocusontheanalysisofresultstofacilitateoptimal

biologicalinterpretation.TheinferenceofbiologicallyrelevantfeatureswithMF

enables discovery from high-throughput data beyond the limits of current

biologicalknowledge– answering questions fromhigh-dimensionaldatathat

wehavenotyetthoughttoask.

DeterminingtheDimensionsofBiologyfromOmicsData

High-throughputtechnologieshaveusheredinaneraofbigdatainbiology[1,2]and

empow-eredinsilicoexperimentationwhichispoisedtocharacterizecomplexbiologicalprocesses

(CBPs;seeGlossary)[3].Thenaturalrepresentationofhigh-dimensionalbiologicaldataisa

matrixofthemeasuredvalues(expressioncounts,methylationlevels,proteinconcentrations,

etc.)inrowsandindividualsamplesincolumns(Figure1).Columnscorrespondingto

experi-mentalreplicates,orsampleswithsimilarphenotypes,willhavevaluesfromthesame

distri-butionofbiologicalvariation.Relatedstructuresinthedataareobservedbecausetheyshare

oneormoreCBP.TheactivityofCBPsneednotbeidenticalineachsample.Inthesecases,the

valuesofallmolecularcomponentsthatareassociatedwithaCBPwillchangeproportionallyto

therelativeactivityofthatCBP.ThesephenotypesandCBPactivitiesareoftenunknowna

priori, requiring computational techniques for unsupervised learning to discover CBPs

directlyfromthebiologicaldata.

TherelationshipsbetweenCBPsandsimilaritiesbetweensamplesconstrainhigh-dimensional

datasetstohavelow-dimensionalstructure.Thenumberofgenes,proteins,andpathwaysthat

areconcurrentlyactivewithinanycellisconstrainedbyitsenergyandfree-moleculelimitations

[4].OnlyacharacteristicsubsetofCBPswillbeactiveinanycellatagiventime.Thus,fora

datasetwherecolumnsshareCBP,alow-dimensionalstructurecanbeextracted whichis

smallerthaneitherthenumberofrowsorthenumberofcolumns.

Matrixfactorization(MF)isaclassofunsupervisedtechniquesthatprovideasetofprincipled

approachestoparsimoniouslyrevealthelow-dimensionalstructurewhilepreservingasmuch

informationaspossiblefromtheoriginaldata.MFisalsoreferredtoasmatrixdecomposition,

Highlights

MFstechniquesinferlow-dimensional structurefromhigh-dimensionalomics data to enable visualization and inferenceofcomplex biological pro-cesses(CBPs).

DifferentMFsappliedtothesamedata willlearndifferentfactors.Exploratory dataanalysisshouldemploymultiple MFs, whereas a specific biological question should employ a specific MFtailoredtothatproblem. MFslearntwosetsoflow-dimensional representations(ineachmatrixfactor) fromhigh-dimensionaldata:one defin-ingmolecularrelationships(amplitude) and another defining sample-level relationships(pattern).

Data-drivenfunctionalpathways, bio-markers,andepistaticinteractionscan belearnedfromtheamplitudematrix. Clustering,subtypediscovery,insilico microdissection,andtimecourse ana-lysisareallenabledbyanalysisofthe patternmatrix.

MFenablesbothmulti-omicsanalyses andanalysesofsingle-celldata.

1_Department_of_Oncology,_Division_of

BiostatisticsandBioinformatics, SidneyKimmelComprehensive CancerCenter,JohnsHopkinsSchool ofMedicine,Baltimore,MD,USA

2_Department_of_{Neuroscience,}_Johns

HopkinsSchoolofMedicine, Baltimore,MD,USA

3_{McKusick-Nathans}_Institute_of

GeneticMedicine,JohnsHopkins SchoolofMedicine,Baltimore,MD, USA

(3)

4_Department_of_Computer_Science,

InstituteforDataIntensiveEngineering andScience,JohnsHopkins University,Baltimore,MD,USA

5

DepartmentofBiostatisticsand ComputationalBiology,Dana-Farber CancerInstitute,Boston,MA,USA

6_Department_of_{Biostatistics,}_Harvard

THChanSchoolofPublicHealth, Boston,MA,USA

7_Vavilov_Institute_of_General_Genetics,

Moscow,Russia

8_University_of_Hawaii_Cancer_Center,

Honolulu,HI,USA

9_Department_of_Systems

PharmacologyandTranslational Therapeutics,PerelmanSchoolof Medicine,UniversityofPennsylvania, PA,USA

10_Childhood_Cancer_Data_Lab,_Alex’s

LemonadeStandFoundation,PA, USA

11_Digital_Technologies_Research

Centre,NationalResearchCouncilof Canada,Ottawa,ON,Canada

12_School_of_Computer_Science,

UniversityofWindsor,Windsor,ON, Canada

13_Department_of_Mathematics_and

Statistics,TheCollegeofNewJersey, Ewing,NJ,USA

14_Department_of_Applied_Mathematics

andStatistics,WhitingSchoolof Engineering,JohnsHopkins University,Baltimore,MD,USA

*Correspondence:

ejfertig@jhmi.edu(E.J.Fertig).

and the corresponding inference problem as deconvolution. Other reviews discuss the

mathematicalandtechnicaldetailsofMFtechniques[5–8]andtheirapplicationstomicroarray

data[9].WefocushereonthebiologicalapplicationsofMFtechniquesandtheinterpretationof

their results since the advent of sequencing technologies. We describe a variety of MF

techniques appliedto high-throughputdata analysis, andcompareand contrast their use

forbiologicalinferencefrombulkandsingle-celldata.

WorkﬂowforMFAnalysis

Afterdatapreprocessing,mosthigh-throughputmoleculardatasetscanberepresentedasa

matrix in whicheach elementcontains the measurement of a single molecule ina single

Genes Samples ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Alignment and

quanﬁcaon

Raw RNA-Seq

reads

Normalizaon and

log-transformaon

Matrix

factorizaon

Amplitude matrix

(molecular relaonships)

Paern matrix

(sample relaonships)

NMF

ICA

PCA

Fact or s Samples _... ... ... ... ... Genes Factors ... ... ... ... ...

• Gene set discovery • Pathway analysis • Biomarker discovery • Clustering analysis • Subtype/subclone discovery • Timecourse analysis • Dependent signals • Nonnegave values • Independent signals • Maximal separaon of (orthogonal) signal

Figure1.OmicsTechnologiesYieldaDataMatrixThatCanBeInterpretedthroughMF.Thedatamatrixfrom

omicshaseachsampleasacolumnandeachobservedmolecularvalue(expressioncounts,methylationLevels,protein concentrations,etc.) asa row. Thisdata matrixispreprocessed with techniquesspeciﬁc toeach measurement technology,andistheninputtoamatrixfactorization(MF)techniqueforanalysis.MFdecomposesthepreprocessed datamatrixintotworelatedmatricesthatrepresentitssourcesofvariation:anamplitudematrixandapatternmatrix.The rowsoftheamplitudematrixquantifythesourcesofvariationamongthemolecularobservations,andthecolumnsofthe patternmatrixquantifythesourcesofvariationamongthesamples.Abbreviations:ICA,independentcomponentanalysis; NMF,non-negativematrixfactorization;PCA,principalcomponentanalysis.

(4)

experimentalcondition.IntheexampleofRNAsequencing(RNA-seq),thenumberofshort

readsfromeachgenearesummarizedintogenelevelcounts.Theresultinghigh-dimensional

datasetisformulatedbyrepresentingthesegenelevelcountsforeachsampleasacolumnin

thedatamatrix(Figure1).MFmethodscanthenbeappliedtothesecountmatricestolearn

CBPsfromthedata.ManyMFtechniquesdescribedareforRNA-seqdatathatare

prepro-cessedusinglogtransformation[10]ormodelsofsequencingdepth[11],whileothersdirectly

modelreadcounts[12].

MostexamplesfeaturedinthisreviewarebasedonsuchpreprocessedRNA-seqanalysisof

genecountsorlog-transformedgenecounts.WenotethatapplicationsofMFarenotlimitedto

thisdatamodalityorthispreprocessingpipeline.Forexample,MFhasalsobeenappliedto

deﬁne mutational signatures in cancer [13,14], allele combinations in phenotypes [15],

transcriptregulationofgenes[16],anddistributionsoftranscriptlengths[17],as wellasto

discriminate peptides in mass spectrometry proteomics [18]. To apply MF to other data

modalities,they must also beproperlypreprocessed into adatamatrix witha distribution

appropriatetotheMFanalysismethod.

Whenappliedtohigh-throughputomicsdata,MFtechniqueslearntwomatrices:onedescribes

thestructure between features (e.g.,genes) and anotherdescribes the structure between

samples(Figure2,KeyFigure).Wecalltheformerfeature-levelmatrixtheamplitudematrix

andthelattersample-levelmatrixthepatternmatrix.Additionaltermshavebeencoinedfor

theamplitudeandpatternmatricesbasedupontheMFproblemappliedandonthespeciﬁc

applicationto high-throughput biological data (Box 1). Thevalues in each column of the

amplitudematrix are continuousweightsdescribingthe relative contribution ofa molecule

ineachinferredfactor.IncaseswherefactorsdistinguishbetweenCBPs,therelativeweightsof

thesemoleculescanbeassociatedwithfunctionalpathways.Thesamemoleculemayhave

high values inmultiplecolumns of theamplitude matrix. Thus,MF techniques are able to

accountforthecumulativeeffectofgenesthatparticipateinmultiplepathways.

Whereaseachcolumnoftheamplitudematrixdescribestherelativecontributionsofmolecules

toafactor,eachrowofthepatternmatrixdescribestherelativecontributionsofsamplestoa

factor(Figure2).Samplegroupscanbelearnedbycomparingtherelativeweightsineachrow

ofthepatternmatrix.ThepatternmatrixfromMFcanalsobebinarizedtoperformclustering

[19,20]orkeptascontinuousvaluestodeﬁnerelationshipsbetweensamples[21–23].Inthe

same way as molecules with high weights within a column of the amplitude matrix are

associatedwithacommonpathway,sampleswithhighweightswithinarowofthepattern

matrixcanbeassumedtoshareacommonphenotypeorCBP.

Theoptimalnumberofcolumnsoftheamplitudematrixandrowsofthepatternmatrixisoften

referredtoasthedimensionalityoftheMFproblem,andlearningthisvalueremainsanopen

problem for the MF research community [24,25]. We also note that MF is not a single

computationalmethod.Instead,thereisawidebodyofliteratureonnumerousMFtechniques

thathavebeenappliedthroughoutcomputationalbiology.Thepropertiesofboththeamplitude

andpatternmatrices,andsubsequentlytheinterpretationoftheirvalues,dependcruciallyon

thespeciﬁcMFproblemandthealgorithmselectedforanalysis.

MFTechniques:PCA,ICA,andNMF

There are numerous approaches to MF. The three most prominent MF approaches are

principal component analysis (PCA), independent component analysis (ICA), and

non-negative matrix factorization (NMF). Each of these techniques has a distinct

Glossary

Amplitudematrix:thematrix learnedfromMFthatcontains moleculesinrowsandfactorsin columns.Eachcolumnrepresents therelativecontributionsofthe genestoafactor,andthesecanbe usedtodeﬁneamolecularsignature foraCBP.

Complexbiologicalprocess (CBP):thecoregulationor coordinatedeffectofmultiple molecularspeciesresultinginoneor morephenotypes.Examplescan rangefromactivationofmultiple proteinsinasinglecellularsignaling pathwaytoepistaticregulationof development.

Computationalmicrodissection:a computationalmethodtolearnthe compositionofaheterogeneous sample,forexamplethecelltypesin atissuesample.

Independentcomponentanalysis (ICA):anMFtechniquethatlearns statisticallyindependentfactors. Matrixfactorization(MF):a techniquetoapproximateadata matrixbytheproductoftwo matrices(Box1),oneofwhichwe calltheamplitudematrixandthe otherthepatternmatrix.

Non-negativematrixfactorization (NMF):anMFtechniqueforwhichall elementsoftheamplitudeand patternmatricesaregreaterthanor equaltozero.

Patternmatrix:thematrixlearned fromMFthatcontainsfactorsin rowsandsamplesincolumns.Each rowrepresentstherelative contributionsofthesamplestoa factor,andthesecanbeusedto deﬁnetherelativeactivityofCBPsin eachsample.

Principalcomponentanalysis (PCA):anMFtechniquethatlearns orthogonalfactorsorderedbythe relativeamountofvariationofthe datathattheyexplain.

Unsupervisedlearning:

computationaltechniquestodiscover featuresfromhigh-dimensionaldata, withoutrelianceonpriorknowledge oflow-dimensionalcovariates.

(5)

mathematicalformulationofadistinctMFproblem(asdescribedinthesupplemental

informa-tiononlineandinotherreviews[5,8,26–29]).

Brieﬂy,PCAﬁndsdominantsourcesofvariationinhigh-dimensionaldatasets,inferringgenes

thatdistinguishbetweensamples.Maximizing thevariabilitycapturedinspeciﬁcfactors,as

opposedtospreadingrelativelyevenlyamongfactors,maymixthesignalfrommultipleCBPsin

KeyFigure

The

Matrix

Product

of

the

Amplitude

and

Pattern

Matrices

Approxi-mates

the

Preprocessed

Input

Data

Matrix

(A)

=

×

Samples

D

ata

Factors

A

mplitude

Factors Samples

P

aern

Rows of connuous or binary weights associate with samples (each sample is a column) capturing relaonships including cell-types/lines, paents, or experimental condions

Columns of genec, epigenecs, or protein weights (each row is a unique molecule) associated with a given sample feature and oen reﬂecve of co-regulaon.

(A)

Factors

Samples

Associated molecular signatures (metagenes, modules, etc.)

●●

Magnitude of rows of the

‘paern’ matrix scaled from

0 Max(P) (B)

Factors

Molecules of interest

Paerns in groups of samples

Molecules of interest

Molecules of interest, i.e. genes

Figure2.Thenumberofcolumnsoftheamplitudematrixequalsthenumberofrowsinthepatternmatrix,andrepresents

thenumberofdimensionsinthelow-dimensionalrepresentationofthedata.Ideally,apairofonecolumnintheamplitude matrixandthecorrespondingrowofthepatternmatrixrepresentsadistinctsourceofbiological,experimental,and technicalvariationineachsample(calledcomplexbiologicalprocesses,CBPs).(B)Thevaluesinthecolumnofthe amplitudematrixthenrepresenttherelativeweightsofeachmoleculeintheCBP,andthevaluesintherowofthepattern matrixrepresentitsrelativeroleineachsample.Plottingofthevaluesofeachpatternforapre-determinedsample grouping(hereindicatedbyyellow,grey,andblue)inaboxplotasanexampleofavisualizationtechniqueforthepattern matrix.Abbreviation:Max(P),maximumvalueofeachrowofthepatternmatrix.

(6)

a single component. Therefore, PCA may conﬂate processes that sometimes occur and

complicate interpretationof theamplitude matrix fordeﬁning data-drivengenesets orthe

inferenceofspeciﬁcCBPs.

ICAandNMFlearndistinctprocessesfromaninputdatamatrixusingdifferenttechniques.ICA

learnsfactorsthatarestatistically independent,resultinginmoreaccurate associationwith

literature-derivedgenesets[30–32].NMFmethodsconstrainallelementsoftheamplitudeand

patternmatricestobegreaterthanorequaltozero[33,34].WhereasthefeaturesinPCAcanbe

rankedbytheextenttowhichtheyexplainthevariationinthedata,thefeaturesinbothICAand

NMFareassumedtohaveequalweight.NMFiswellsuitedtotranscriptionaldata,whichis

typicallynon-negativeitself,andsemi-NMFisalsoapplicabletodatathatcanhavenegative

values.Theassumptionsof NMFmodelboth theadditivenatureof CBPsand parsimony,

generatingsolutionsthatarebiologicallyintuitivetointerpret[35].

The solutions from both ICA and NMF may vary dependingupon the initialization of the

algorithm,leadingtodisparateamplitudeandpatternmatrices.Therefore,itiscrucialtoensure

thatparticularsolutionusedforanalysisprovidesanoptimalandrobustsolutionbeforeusing

theresultsofthefactorizationtointerpretCBPs.WepreviouslyfoundthatBayesiantechniques

tosolveNMFhavemorerobustamplitudematricesthangradient-basedtechniques,andthus

generatemoreaccurateassociationsbetweenthevaluesintheamplitudematrixandfunctional

pathways [5,36]. Other studies have found that gradient-based approaches have similar

computationalperformancetotheirBayesiancounterparts[37],andnewtechniquesarebeing

developedtoenhancethestabilityofthefactorization[38].Furtherintegratedcomputational/

experimentalinvestigation isnecessarytoassessthebiologicalrelevance ofsolutionsfrom

bothclassesoftechniquesandtheirrobustness.Theseassociationsalsodependcruciallyon

theinput data. Therefore,to learnCBPs from data, MF must beapplied to datasetswith

sufﬁcientmeasurementsoftheexperimentalperturbationsorconditionsrelevanttothespeciﬁc

biologicalproblembeingaddressed.

SampleApplicationtoGenotype-TissueExpression(GTEx)ProjectGene ExpressionData ofPostmortemTissues

ApplyingmultipletypesofMFtechniquestothesamedatasetorasingleMFtodistinctsubsets

ofadatasetcanalsoﬁnddistinctsourcesofvariability.SelectingwhichMFmethodtouseto

learnrelevantCBPsfromadatasetisthencrucial,anddevelopingstandardizedmetricsto

Box1.CommonTerminologyintheLiterature

Historically,theindependentdiscoveryofMFinmultipleﬁeldsincludingmathematics,computerscience,andstatistics createddistinctterminologiesthatareoftenusedinterchangeablyasanalyticalorthologsingenomics.Forexample,the term‘factorization'isoftenusedinterchangeablywith‘decomposition'.Otherterms,suchas‘features',‘components', ‘latentvariable',or‘latentfactors'areusedtorefertorelationshipsbetweeneithermolecularmeasurementsorsamples, dependingonthecontext.

Thespeciﬁcterminologyfortheamplitudeandpatternmatricesalsovariesaccordingtothemethod,andarepreferably labeledwithdifferentvariablenames.InPCA,theamplitudematrixisoftencalledthescoreorrotationmatrix,and patternmatrixcalledtheloadings.InICA,theamplitudematrixiscalledtheunmixingmatrixandthepatternmatrixis calledthesourcematrix.InNMF,theamplitudematrixiscommonlycalledtheweightsmatrixandthepatternmatrixis calledthefeaturesmatrix.

Throughoutthispaperweuse‘amplitudematrix'torefertothematrixthatcontainsvectorswhichrepresent relation-shipsbetweenmolecularmeasurements.Othertermsusedintheliteratureforthesemolecularrelationshipsinclude modules,meta-pathways,andsignatures.Similarly,weuse‘patternmatrix'torefertothematrixthatcontainsvectors whichrepresentrelationshipsbetweensamples.Otherliteraturetermsforthesesample-levelrelationshipsinclude patterns,metagenes,eigengenes,sources,andcontrollingfactors.

(7)

choosebetweenthemisanactiveareaofresearchincomputationalbiology.Toillustratethe

differencesbetweenMFmethods,weapplyPCA,ICA(CRANpackagefastICA[39]),andNMF

(R/BioconductorpackageCoGAPS[40,41])toasingledatasetfrompostmortemsamplesin

theGTExproject[42](Figure3).Speciﬁcally,weselectasubsetofGTExdatacontaining12

braintissuesintheGTExdataforsevenindividualsforwhichwehadpreviouslyperformedonly

NMFanalysis[41](codesareprovidedinthesupplementalinformationonline).Weselectthis

problembecausetheCBPs(tissueandindividual)arereadilyseparatedandareknownapriori,

providingaknowngroundtruthtofacilitatecomparisonofmethods.

FirstweapplyPCAtothissampleGTExdataset(Figure3A).Thecomponentslearnedfrom

PCAcanberankedbytheamountofvariationthattheyexplaininthedata,withtheﬁrsttwo

componentsexplaining89.6%ofthevariationinthisdataset.Theamountofvariationexplained

−0.1 0.1 0.3 0.5 −0.1 0.1 0.3 PC 1 PC 2

PCA colored by Ɵssue type

−0.1 0.1 0.3 0.5 −0.1 0.1 0.3 PC 1 PC 2 PCs

PCA colored by GTeX subject

1 5 9 Propor on of v ar iance 0.0 0.2 0.4 0.6

Scree plot of PCs vs variance

(A) (C) −0.10 −0.05 0 0.05 0.10 0.15 Ammon's hor n Am ygdala Ante rior cingulate Caudate n ucleus Cerebellum Dor solater al pfc Cer vical spinal cord

Frontal cor tex Hypotha lam us Nucleus accu mb ens Pu tam en Substan a nigr a −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 NMF blue Iden Ɵ ty pa Ʃ erm 0.0 0.2 0.4 0.6 0.8 1.0 Cerebellum NMF pa Ʃ erm 0.0 0.2 0.4 0.6 0.8 1.0 NMF yellow Iden Ɵ ty pa Ʃ erm

ICA cerebellum paƩern

Ammon's ho rn Am ygdala Anter ior cingulate Caudate n ucleus Cerebellum Dorsolate ral pfc Cer vical spinal cordFrontal cor

tex Hypothalam us Nucleus accumbens Putamen Substan a nigr a

ICA IdenƟty paƩern

(B) Ammon's hor n Am ygdala Anter ior cingulate Caudate nucleus Cerebellum Dorsolater al pfc Cer vical spinal cordFrontal cor

tex Hypothalam us Nucleus accumbens Putamen Substan a nigr a Ammon's hor n Am ygdala Anter ior cingulate Caudate n ucleus Cerebellum Dorsolater al pfc Cer vical spinal cordFrontal cor

tex Hypothalam us Nucleus accumbens Putamen Substan a nigr a

Figure3.ComparisonofPatternMatrixFromMatrixFactorization(MF)inPostmortemTissueSamplesfromGTEx.(A)PCAﬁndsfactorsinrowsofthe

patternmatrixthatcanberankedbytheamountofvariationthattheyexplaininthedata,asillustratedinascreeplot.PCAanalysestypicallyplotthefirsttwoprincipal components(PCs;rowsofthepatternmatrix)toassesssampleclustering.PointsarecoloredbytissuetypeannotationsfromGTEx(left),whereAmmon’shornrefersto thehippocampus,anddonor(right).InGTExdata,thecerebellum(lightblue)andfirstcervicalspinalcord(yellow)clusterseparatelyfromallotherbraintissues,butno separationbetweenindividualsisobserved.(B)ICAfindsfactorsassociatedwithindependentsourcesofvariation,andthereforecannotberankedinascreeplot.The relativeabsolutevalueofthemagnitudeofeachelementinthepatternmatrixindicatestheextenttowhichthatsamplecontributestothecorrespondingsourceof variation.Thesignofthevaluesindicateover-orunderexpressioninthatfactordependingonthesignofthecorrespondinggeneweightsintheamplitudematrix.Asa result,thevaluescanbeplottedontheyaxisagainstknowncovariatesonthexaxistodirectlyinterprettherelationshipbetweensamples.WhenappliedtoGTEx,we observeonepatternassociatedwithcerebellum,anotherpatternthathaslargepositivevaluesforonedonorandlargenegativevaluesforanotherdonor,andeight otherpatternsassociatedwithothersourcesofvariation(supplementalinformationonline).(C)NMFfindsfactorsthatarebothnon-negativeandnotrankedbyrelative importance,similarlytoICA.Thevalueofthepatternmatrixindicatestheextenttowhicheachsamplecontributestoaninferredsourceofvariationandisassociated withoverexpressionofcorrespondinggeneweightsintheamplitudematrix.ValuesofthepatternmatrixcanbeplottedsimilarlytoICA.WhenappliedtoGTEx,we observeonepatternassociatedwithcerebellum,twomorepatternsassociatedwiththetwodonorsthatwereassignedtoasinglepatterninICA,andsevenother patternsassociatedwithother sourcesofvariation(supplementalinformationonline).Abbreviations:GTEx,Genotype-TissueExpression (GTEx)project;ICA, independentcomponentanalysis;NMF,non-negativematrixfactorization;PCA,principalcomponentanalysis.

(8)

bythesecomponentscanbeusedtodeterminetheoptimaldimensionalityforPCA[24].A

typical PCA analysis explores the association between these components and biological

covariatesinlow-dimensionalplotsinwhicheachaxisisdeﬁnedbytheweightsinonerow

ofthepatternmatrix(Figure3A).Similarlytoclusteringanalyses,theseplotscanbeusedto

determinewhichbiologicalfeaturescanbeseparatedfromthedata.Inthisapplication,we

observethatthecerebellum(lightblue)andﬁrstcervicalspinalcord(yellow)clusterseparately

fromallotherbraintissues(PC1andPC2,orrows1and2ofthepatternmatrix,respectively).

NoseparationbetweenindividualsisobservedinthisPCAanalysis.

IncontrasttoPCA,thecomponentsofICAandNMFcannotberankedbypercentvariation

explained.Instead,eachrowofthepatternmatrix isan equallyimportantCBPinthedata.

Therefore,thesepatternsareplottedindependentlyrelativetobiologicalcovariates,andnot

relativetooneanotherasinPCAanalyses.WhenappliedtothesampleGTExdataset,rowsof

the pattern matrix from both ICA (Figure 3B) and NMF (Figure 3C) also distinguish the

cerebellum, similarly to PCA. Whereas PC1 has large positive values for the cerebellum

andlargenegativevaluesfortheotherbraintissues,bothNMFandICAhavelargeabsolute

valueonlyforthetissueofinterestandarenearzeroforothertissues.Asaresult,bothICAand

NMFprovidetissue-speciﬁcpatternsandPCAprovidestissue-segregatingpatterns.Wenote

thatthe magnitude ofall thesepatternsisunit-less, reﬂectingthe relative weightsofeach

sampleinthepatternandnotameasurablequantity.Becausegeneweightscanbeeither

positiveornegativeinICA,thepatternsweightscanalso.Bycontrast,byconstructiontheNMF

valuesareallnon-negative.Asaresult,thepatterncorrespondingtothecerebellumsamples

fromICAisstillspeciﬁctothattissue,withgenesoverexpressedinthistissuehavingnegative

weightsinthecorrespondingcolumnoftheamplitudematrix,andgenesunderexpressedin

thattissuehavingpositiveweights.Indeed,thegeneweightsinthecolumnoftheamplitude

matrixcorrespondingtotheNMFcerebellumpatternaresigniﬁcantlyanti-correlatedwiththe

geneweightsinthe columnof the amplitudematrix corresponding to the ICAcerebellum

pattern(R= 0.72,P<210 16).

Incontrast to PCA, both ICAand NMFalso infer patternsthat distinguish individualswith

commonweightsacrossalltissues.IntheNMFanalysis,eachindividualhasaseparatepattern.

TheICAanalysishasasinglepatternthathaslargepositivevaluesforoneoftheindividuals

distinguishedinoneNMFpattern,andlargenegativevaluesforanotherindividualdistinguished

indifferentNMFpattern.Thisdiscrepancyhighlightsthedifferencebetweeninferring

indepen-dentsourcesofvariationwithICAandNMF.Speciﬁcally,ICAmaycombinemultipleCBPsusing

commongeneswhosesignchangesbyexperimentalcondition,whereasNMFwillﬁndCBPs

thatareadditive,correspondingonlytooverexpressionofgenesinthatcondition.

PCA,ICA,andNMFareequallyvalid,andtheirdistinctformulationgivesrisetothedistinct

featuresobservedinthedata.ApplicationsofmultipletypesofMF techniques,oreventhe

sameMFalgorithmwithdifferentparameters,mayinferseveralCBPsorphenotypeswithina

singledataset,inessenceprovidinganswerstodifferentquestions.Wefurthernotethatthe

speciﬁctechniquesforPCA,ICA,andNMFselectedforanalysisinthisexampleareonlyoneof

amultitudeofvariantsoftechniqueswhichhavebeendevelopedforcomputationalbiology.

ApplyingmultipletypesofMFtechniquestothesamedataset,orasingleMFtodistinctsubsets

ofadataset,canalsoﬁnddistinctsourcesofvariability.However,thegeneralpropertiesof

thesemethodswill remainandbeconsistent withourexample.Brieﬂy,weobserveinthis

examplethatPCAﬁndssourcesofseparationinthedata,whereasbothICAandNMFﬁnd

independentsourcesofvariation.ICAcanﬁndbothover-andunderexpressionofgenesina

(9)

maybettermodelbothrepressionandactivationthanNMF,butasasideeffectmayhave

greatermixtureofCBPsthanNMF.

Regardlessofthetechniqueselected,theresultswillalsobesensitivetotheinputdata.For

example,adifferentNMF-classalgorithmcalled‘gradeofmembership'(GOM)wasalsoappliedto

alargersetofpostmortemsamplesinGTEx.Thisalgorithmfoundapatternthatcombinedall

samplesfrombrainregionswhenappliedtoalltissuesamplesinGTEx,butseparatedthedistinct

brainregionswhenappliedonlytotissuesamplesfromthebrain.Thus,applyingmultipletypesof

MFtechniquestothesamedataset,orasingleMFtodistinctsubsetsofadataset,canalsoﬁnd

distinctsourcesofvariabilitythatareessentialforexploratorydataanalysis.

FurtherExampleApplicationsToRepresentCellTypes,DiseaseSubtypes, PopulationStratiﬁcation,TissueComposition,andTumorClonalitywiththe PatternMatrix

ExactlyasweobservedintheGTExexample,asinglefactorizationofcomplexdatasetscan

ﬁndmultipledistinctsourcesofvariation.Forexample,thepowerofMFtoidentifymultiple

sourcesofvariationwas seenwhen multipletechnicalfactorsfromsampleprocessingand

biologicalfactorswerediscoveredinanICAofgeneexpressionproﬁlesof198bladdercancer

samples[43].OnefactorinthepatternmatrixofthisanalysisdeﬁnedaCBPassociatedwith

gender.BecauseICAsimultaneouslyaccountsformultiplefactorsinthedataasseparaterows

inthesamematrix,eachrowcanfullydistinguishasinglebiologicalgroupingfromthedata.

AnalysisofasingledatasetwithoneMFalgorithmusingdifferentnumbersoffactorscanreﬂect

ahierarchyofbiologicalprocesses.Forexample,applyingCoGAPStodatafromasetofhead

andnecktumorsandnormalcontrolsfor arangeofdimensionalitieswasable toseparate

tumorandnormalsampleswhenlimitedtotwopatterns,butfurtherdecomposedthetumor

samplesintothetwodominantclinicalsubtypesofheadandneckcancerwhenidentifyingﬁve

patternsforthesamedata[44].Thehierarchicalrelationshipbetweenpatternshasbeenused

toassesstherobustnessofpatternstoquantifytheoptimalityofthefactorization[45]andlearn

theoptimaldimensionalityofthefactorization[46].Otheralgorithmsusestatisticalmetricsto

estimatethenumberoffactors[12,47].Whilethesealgorithmsquantifyﬁttothedata,theymay

disregard thehierarchical nature ofdistinct CBPs learned byfactoring biological datainto

multipledimensions.Thisobservationhighlightsthecomplexityofestimatingthenumberof

factorsforoptimalMFanalysisofbiologicaldata(seeOutstandingQuestions).

Moreover,applicationofdifferentMFalgorithmstothesamedatasetcangivedifferentsample

groupingsthatreﬂectbiology.Forexample,inpopulationgeneticsagroupinginferredfrom

GWASwhichdistinguishesancestryisequallyvalidtoagroupinginferredfromthesameGWAS

datawhichdistinguishesdiseaserisk.TheapplicationofPCAtoSNPdatafrom3000European

individuals[48]demonstratesinferenceofsamplecorelationshipsusingthepatternmatrix,and

foundthatmuchofthevariationinDNAsequenceisexplainedbythelongitudeandlatitudeof

thecountryoforiginofanindividual.Inaddition,statisticalmodelscanbeformulatedassuming

thattheinheritanceofanindividualarisesfromproportionsofancestryindistinctpopulations

throughgenetic admixture[48].AnMF-based techniquecalledsparse factoranalysis also

distinguishesbetweenthesepopulationsusingGWASdata[49].Theseanalysesdemonstrate

thattheancestryofeachindividualisadominantsourceofvariationinDNAsequence.Atthe

sametime,sourcesofvariationinGWASdataarisefromvariantsthatgiverisetodiseaserisk,

whichcanbesharedamongindividualswithdiversegeneticbackgrounds[50].Inthesameway

asweobservedinourGTExexample,theapplicationofmultipleMFtechniquesisessentialto

(10)

MixturesofcelltypesinbiologicalsamplesintroduceafurtherdegreeofcomplexitytoMFanalysis

ofbiological variation in theirmolecular data. Computationalmicrodissectionalgorithms

estimatetheproportionofdistinctcelltypeswithinabulksamplebyapplyingMFtogeneswhose

expressionisuniquelyassociatedwitheachcelltype[51].Subsettingthedatatodifferentgenes

maygiverisetodifferentfactorsthatrepresentdifferentCBPs.Nonetheless, CoGAPSNMF

analysisofdatasubsetsthatwereobtainedbyselectingequallysizedsetsofrandomgenesfound

thatthepatternmatriceswereconsistentforeachrandomgenesetintheexpressiondata[41,52].

TheseresultssuggestthatthedependencyofanMFonthespeciﬁcgenesusedforanalysismay

dependontheheterogeneityofthesignalinthedatamatrix.

Cellularandmolecular heterogeneityposesa particularchallengeto MF analysisincancer

genomics.Evenapuretumortissuecancontainnumeroussubclonesowingtothe

accumula-tionofdifferentdrivereventsduringtumorevolution.NewMFtechniqueshavebeendeveloped

toestimatetheproportionofthetumorthatarisesfromeachsubclone[47,53–56].

Assump-tionsabouttheevolutionarymechanismsoftheaccumulationofmolecularalterationscanalso

beencodedinthefactorizationtomodeltheresultingheterogeneityoftheseclones[12,47].

Thesestudiesdemonstrate thatencodingpriorknowledgeinto MFcan focustheresulting

factorstoreﬂectoneoftheequallyvalidbiologicalgroupingswithinthedata.

FromSnapshotstoMovingPictures:SimplifyingTimecourseAnalysis

EntwinedinthechallengeofdecomposingcelltypesandsubpopulationsisthefactthatCBPs

change over time. High-throughput timecourse datasets are emerging in the literature to

accountforthe dynamics ofbiologicalsystems.Thecentralgoal oftimecourse analysisis

todeterminetheextenttowhichmoleculeschangeovertimeinresponsetoperturbations(e.g.,

developmental time, environmentalfactors, disease processes,or therapeutictreatments).

Associating molecular alterations often relies on specialized bioinformatics techniques for

timecourseanalysis[57,58].MFanalysescannaturallyinferchangesinCBPsovertimewhen

appliedtotimecoursedatabecausethecontinuousweightsforeachsampleinthepattern

matrixcanvaryamongsamplescollectedacrossdistincttimepoints.Therelativeweightsof

rowsofthepatternmatrixcanencodethetimingofregulatorydynamicsdirectlyfromthedata

(Figure4).Nonetheless,mostMFalgorithmsfortimecourseanalysisdonotencodetheknown

timepointsorretaintheirrelativeordering.Methodsthatspeciﬁcallyusethesetemporaldataare

currentlyanactiveareaofresearch.

Complex biological process (CBP) captured in sample features, for example:

Samples

Magnitude of rows of the _‘pa

ern’ matrix scaled from

0 Max(P)

P1 P2 P3

Pa ern matrix (NMF, mecourse)

d6 ... d1 ... d6d1 ... d6 Time d1 1 2 3 4 5 6 P1 1 2 3 4 5 6 P2 1 2 3 4 5 6 P3 Days

Pa erns in groups of samples

Group

Figure4.SamplesCorrespondtoTimepoints;theRowsofthePatternMatrixCanBePlottedasaFunction

ofTimeandSampleConditionToInfertheDynamicsofComplexBiologicalProcesses(CBPs).

Abbrevia-tions:d1–d6,days1–6;max(P),maximumvalueofeachrowofthepatternmatrix;NMF,non-negativematrixfactorization; P1–3,patterns1–3.

(11)

BothICAand NMFwere found to have signaturescharacterizing the yeastcellcycle and

metabolisminearlytimecoursemicroarrayexperiments[59,60].ThesparseNMFtechniques

usingBayesian methodshadpatternsthatreﬂectedthesmoothdynamicsofthesephases

[36,59]. This approach has been shown to simultaneously learn pathway inhibition and

transitoryresponsesto chemicalperturbationofcancercells [61]andrelatethechangesin

phospho-proteomictrajectories betweenmultipletherapies [62].Similar analysis ofhealthy

braintissueslearnedthedynamicsoftranscriptionalalterationsthatarecommontotheaging

processinmultipleindividuals[52].MFtechniquesdesignedforcancersubclonesdescribedin

theprevioussectionhavealsobeenappliedtorepeatsamplestolearnthedynamicsofcancer

development, thereby elucidating the molecular mechanisms that give rise to therapeutic

resistanceandmetastasis.Eveniftherearethesamenumberofbiologicalfeatures,therate

or timing of related features in different molecular modalities may be offset [63]. These

discrepanciesbydatamodalitysuggestthatdifferentregulatorymechanismsmaybe

respon-sibleforinitiatingandstabilizingthemalignantphenotype[63].

Data-DrivenGeneSetsfromMFProvideContext-Dependent Coregulated GeneModulesandPathwayAnnotations

Genomicdataareofteninterpretedbyidentifyingmolecularchangesinsetsofgenesannotatedto

functionally related modules orpathways, called gene sets[64,65]. Often the associations

betweengenesetsandfunctionsarebaseduponmanualcurationoftheliterature[66,67].Such

set-level interpretations often lack important contextual information [64,68,69], and cannot

describegenesofunknownfunctionorgenesassociatedwithnewfunctionalmechanisms.

TheamplitudematrixfromMFanalysiscanbeusedbothforliterature-basedgene-setanalysis

andtodeﬁnenewdata-drivengenesignatures(Figure5).Standardgene-setanalysiscanbe

applieddirectlytothevaluesineachcolumnoftheamplitudematrixtoassociatetheinferred

factorswithliterature-curatedsets.New,context-dependentgenesetscanalsobelearned

fromthevaluesintheamplitudematrix.Gene-setannotationsareoftenbinary.Thresholding

techniquesto selectwhichgenesbelongtoapathwayfromtheamplitudematrixforbinary

membershipprovideanoutputsimilartogenesetsindatabases[70,71].Otherstudiesalso

integratethe literature-derivedgene signaturesinthesethresholdsto reﬁne the contextof

pathwaydatabases[36,72].Thegenesderivedfromthesebinarizationscanbeusedasinputs

topathwayanalysesfromdifferentialexpressionstatisticsinindependentdatasets(Figure5,

right),andare analogoustothe hierarchicalclustering-based genemodules[73] andgene

expressionsignaturesfrom publicdomain studies in theMSigDB gene-set database[74].

Anothermeansofbinarizingthedataistoﬁndgenesthataremostuniquelyassociatedwitha

speciﬁcpatterntouseasbiomarkersofthecelltypeorprocessassociatedwiththatpattern

[41,75].SelectinggenesbaseduponthesestatisticscanfacilitatevisualizationoftheCBPsin

high-dimensionaldata[41].Whereasbinarizationofgeneswithhighweightscanassociatea

singlegenewithmultipleCBPs,thestatisticsforuniqueassociationslinkagenewithonlyone

CBP.Therefore,thesestatisticsalsodeﬁnespeciﬁcgenesthatmaybebiomarkersofthecell

type/stateoraprocess[41](Figure5,right).

Althoughbinarypathwaymodelsaresubstantiallyeasiertointerpret,continuousvaluesfrom

theoriginalfactorizationprovideabettermodeloftheinputdata.Weightedgenesignatures

have been shown to be more robust to noise and missingvalues inthe data [76]. If the

expressionlevelofageneispoorlymeasuredinasample,othergenesinthesamefactorcan

implytheactualexpressionlevelofthe geneinquestion. Byconsideringeach geneinthe

(12)

continuoussignaturescanbeassociateddirectlywithothersamplesusingprojectionmethods

[76,77]orproﬁlecorrespondencemethods[78].

MFEnablesUnbiasedExplorationofSingle-CellData forPhenotypesand MolecularProcesses

MF approachesareanatural choice insingle-cell RNA-seq(scRNA-seq) dataanalysis owing to the

highdimensionalityofthedata,andareusedtoidentifyandremovebatcheffects,summarize

CBPs,andannotatecelltypesinthedata[79–82].WhereasMFanalysisofbulkdatadissects

groupsfromasmallsubsetofsamples,theanalysisinscRNA-seqdataaggregatescellsinto

groupsofcommoncelltypesorCBPs[75,83].Oftentheseanalysesareperformedonasubsetof

thedatacontainingthemostvariablegenes.Newercomputationallyefﬁcientmethodsarebeing

developedtoenablefactorizationoflargeomicsdatasetsforgenome-wideanalysis[84].

Biologi-calknowledgecanbeencodedwithaclassofMFalgorithmsthatsummarizefactorsusinggene

sets[79,85,86].

Features

A priori func onally annotated sets

Molecules of interest, (i.e,. genes)

_{Molecules of interest, (i.e,. genes)}

P < 0.001 P < 0.001

WT

KO

Enrichment analysis can be

done directly or via thresholding

Unique ‘pa

Ʃernmarkers’ can be

used for biological valida

Ɵon

Up Do wn 1.00 0.69 0.57 0.37 0.19 −0.02 −0.17 −0.30 −0.39 −0.51 −0.95 0 4.1 Enr ichment

Features

Projec

Ɵon into new data

Molecules of interest, (i.e,. genes)

New samples

Amplitute matrix

_{amplitude matrix}

Thresholded

Figure5.TheAmplitudeMatrixfromMatrixFactorization(MF)CanBeUsedtoDeriveData-DrivenMolecularSignaturesAssociatedwithaComplex

BiologicalProcess(CBP).ThecolumnsoftheamplitudematrixcontaincontinuousweightsdescribingtherelativecontributionofamoleculetoaCBP(centerpanel;

indicatedbytheorange,purple,andgreenboxes).Theresultingmolecularsignaturecanbeanalyzedinanewdatasettodeterminethesamplesinwhicheach previouslydetectedCBPoccurs,andtherebyassessfunctioninanewexperiment.Thiscomparisonmaybedonebycomparingthecontinuousweightsineachcolumn oftheamplitudematrixdirectlytothenewdataset(left).Theamplitudematrixmayalsobeusedintraditionalgene-setanalysis(right).Traditionalgene-setanalysisusing literaturecuratedgenesetscanbeperformedonthevaluesineachcolumnoftheamplitudematrixtoidentifywhetheraCBPisoccurringintheinputdata.Data-driven genesetscanalsobedefinedfromthismatrixdirectlyusingbinarization,andusedinplaceofliterature-curatedgenesetstoqueryCBPsinanewdataset.Setsdefined frommoleculeswithhighweightsintheamplitudematrixcomprisesignaturesakintomanycuratedgene-setresources,whereasmoleculesthataremostuniquely associatedwithaspecificfactor(purplebox)maybebiomarkers.Abbreviations,KO,knockout;WT,wildtype.

(13)

MostMFtechniquesdevelopedforbulkomicsdataassumethatthegeneexpressionchanges

fromCBPsareadditive.ThisassumptionisviolatedinscRNA-seqdata.Onereasonforthe

violationoftheadditiveassumptioninMFistheinabilitytodistinguishtruezerosfrommissing

values.Imputation methods for preprocessing [81,87] ornewer MF algorithms that model

missingdataareessentialforscRNA-seqdata.Branchingoftrajectoriesofcellularstatesand

lackofcell-cyclesynchronizationinscRNA-seqdatafurtherviolatetheadditiveassumptionin

MF.Newnonlinearfactorizationtechniquesarebeingdevelopedtoenhancevisualizationof

trajectorystructuresinsingle-celldata[80,88–90]inthesecases.However,theresultsfrom

these methods cannot be interpreted in the same way as those from MF algorithms. In

particular,thelow-dimensionalsolutionsfromthesemethodsarenotnecessarilyuseableto

reconstructtheoriginalhigh-dimensionaldata.

ConcludingRemarks

MF encompasses a versatile class of techniques with broad applications to unsupervised

clustering,biological pattern discovery,component identiﬁcation, andprediction. Since MF

wasﬁrstappliedtomicroarraydataanalysisintheearly2000s[59,91–93],thebreadthofMF

problemsandalgorithmsforhigh-throughputbiologyhasgrownwiththeirbroadapplications.MF

problemsareubiquitousinthecomputationalsciences,withexamplesincludingunsupervised

featurelearning[94–99],clusteringandmetriclearning[100–102],latentDirichletallocation[103],

subspacelearning[104–109],multiviewlearning[110],matrixcompletion[111],multitasklearning

[112],semi-supervisedlearning[113],compressedsensing[114],andsimilarity-basedlearning

[115,116].DimensionreductionofbiologicaldatawithMFhighlightsperspectivesandquestions

thatinvestigatorshavenotyetconsidered,andalsoenablestractableexplorationofotherwise

massivedatasets.Asthesizeofthesedatasetsgrow,itiscrucialtodevelopnewalgorithmsto

solveMFproblemsthatscalewiththeever-increasingsizeofomicsdata[37,41,117,118].MF

algorithmscanalsobeextendedforsimultaneousanalysisofdatafrommultipledatamodalities,

enablinggenomicdataintegration[7,8,119].TechniquesthatextendthisintegratedMF

frame-work,includingBayesiangroupfactoranalysis[8]andtensordecomposition[120–123],canalso

analyzedatasetsacrossdifferentmolecularlevels[124].Developingsuchdata-integration

tech-niquesisanactiveareaofresearchinbothgenomicsandcomputationalsciences.

DifferentclassesoftechniquessolveMF,includinggradient-basedandprobabilisticmethods

(supplementalinformationonline).DistinctMFproblemseachaimtoidentifyspeciﬁctypesof

features.Insomecasesdifferentalgorithmswilllearndistinctfeaturesfromthesamedataset.

Therefore,investigatorsmaybeneﬁtfromapplyingmultipletechniqueswithdifferentproperties,or

bycarefullyconsideringboththedatasetandthequestioninselectingexactlytherighttechnique

forthatquestion.Mostsuchcomparisonsintheliteraturehavebeenmadebyinvestigatorswho

aredevelopingMFmethods.UnbiasedassessmentsoftherelativeperformanceofdifferentMF

algorithmsfordifferentexploratorydataanalysisproblemsareessentialtodeterminetherelative

strengthsandweaknessesofeachmethodfordistinctbiologicalproblems.MFalgorithmscanbe

furthertailoredtothebiologicalproblemofinterestusingmethodsthatalsoencodepriorbiological

knowledgeofthesystemunderlyingthemeasureddataset[125–127].

ThefeaturesMFtechniquesextractareconstrainedbythedatasetusedtotrainthem.These

algorithms cannot learn unmeasured features, nor can they correct for complete overlap

betweentechnical artifactsand biological conditions. Thus, being mindful of experimental

design when selecting datasetsand choosing those that are broad enough to cover the

relevant sourcesof variability isessential. Advances in MF and related techniques will be

essentialforpoweringsystems-levelanalysesfrombigdata(seeOutstandingQuestions).

OutstandingQuestions

Howcantheoptimalnumberoffactors bequantified?Increasingthenumber offactorsinMFimprovesits approxi-mation, but may cause overfitting. Computational metrics that balance thesepropertiescanbeusedto esti-matethenumberoffactors. Biologi-callydrivenmetricsarealsoessential toassesswhetherthisnumber repre-sentsthenumberofCBPs. Howcandifferentfactorizationsbe com-pared?DifferentMFalgorithmsidentify differentpropertiesandCBPsfroma sin-gledataset.Newtechniqueswillbe nec-essary to compare and merge the disparatebutequallyvalidfactorizations. What techniques are best for efficientand optimalfactorization?Largeomics data-sets are common, particularly with single-cell technologies. Fast computational approachestoMFarerequiredforthe analysisofthesedata.Thesealgorithms also require computational criteria to ensurethatthesolutionsarerobust.

Howcanfullynonlinearfactorizationsbe performedonomics-scaledata? Nonlin-earextensionstoMFhavenotable appli-cationstosingle-celldata.Researchers oftenuselow-dimensional representa-tionsofomicsdatalearnedfromlinear MFasinputstononlinearMFtoaccount forbothcomputationalefﬁciencyandthe underlyingmathematicalassumptionsof thesenonlinearmethods.Breakthroughs intheoriesfortechniquessuchaskernel MF,deepMF,andmanifoldlearningwill enablefullynonlinearfactorizationsfor large-scalesingle-celldata.

Howcanbothgeneregulatory relation-shipsanddistinctsourcesoftechnical variation be encoded in integrative analysis ofdata fromdifferent mea-surement technologies? Different measurementtechnologiesyielddata thatfollowuniquedistributions.CBPs and their timing also vary between typesofmeasures.Integrative techni-quesmustaccountforboththe tech-nicalandbiologicalsourcesofvariation between disparate data modalities, includingsparseormissingdata.Such techniquesarecrucialtolearningthe regulatory relationships that drive CBPsandtomodelingthemultiscale nature of biological systems from omicsdata.

(14)

Acknowledgments

WethankOrlyAlter,DavidBerman,J.BrianByrd,MichaelLove,IreneGallegoRomero,LillianFritz-Laylin,Luciane Kagohara,LouiseKlein,CraigMak,MatthewStephens,DanielaWitten,andothermembersof‘NewPISlack’fortheir insightfulfeedback.ThisworkwassupportedbytheNationalInstitutesofHealth(NIH)NationalCancerinstitute(NCI)and the National Libary of Medicine (NLM) (grants NCI 2P30CA006516-52 and 2P50CA101942-11 to A.C.C., NCI R01CA177669E.J.F.,NCIU01CA212007toE.J.F.,NLMR01LM011000toM.F.O.,andNCIP30CA006973),Johns HopkinsUniversityCatalystandDiscoveryAwardstoE.J.F.,aJohnsHopkinsUniversityInstituteforDataIntensive EngineeringandScience(IDIES)AwardtoE.J.F.andR.A.,aJohnsHopkinsSchoolofMedicineSynergyawardtoE.J.F. andL.A.G.,agrantfromTheGordonandBettyMooreFoundation(GBMF4552)toC.S.G.,Alex’sLemonadeStand Foundation’sChildhoodCancerDataLab(C.S.G.),awardK01ES025434fromtheNationalInstituteofEnvironmental HealthSciences(NIEHS)throughfundsprovidedbythetrans-NIHBigDatatoKnowledge(BD2K)initiative(L.X.G.),award P20COBREGM103457fromtheNIH/NationalInstituteofGeneralMedicalSciences(NIGMS;toL.X.G.),R01LM012373 fromtheNLM(L.X.G.),R01HD084633fromtheNationalInstituteofChildHealthandHumanDevelopment(NICHD;toL.X. G.),theDepartmentofDefenseBreastCancerResearchProgram(BCRP;awardBC140682P1,A.C.C.),theNational ScienceandEngineeringCouncilofCanada(NSERC;DGgrantnumberRGPIN-2016-05017,A.N.),theWindsor-Essex CountyCancerCentreFoundation(Seeds4Hope grant814221,A.N.),HopkinsinHealth andBooz AllenHamilton (90056858)toY.X.,theRussianFoundationforBasicResearch(KOMFI17-00-00208)andNIH(NCIP30CA006973) toA.V.F.,andtheNationalResearchCouncilofCanadatoY.L.Thisprojecthasbeenmadepossibleinpartbygrants 2018-183444(E.J.F.),2018-128827(L.A.G.),and2018-182718(C.S.G.)fromtheChanZuckerbergInitiative Donor-AdvisedFund(DAF),anadvisedfundoftheSiliconValleyCommunityFoundation.Theviewsandopinionsof,and endorsementsby,theauthor(s)donotreﬂectthoseoftheUSArmyortheDepartmentofDefense.

SupplementalInformation

Supplementalinformationassociatedwiththisarticlecanbefound,intheonlineversion,athttps://doi.org/10.1016/j.tig.

2018.07.003.

References

1. Bell, G. etal.(2009) Beyond the data deluge. Science323, 1297–1298

2. Sagoff,M.(2012)Datadelugeandthehumanmicrobiome

project.IssuesSci.Technol.28, http://issues.org/28-4/

sagoff-3/

3. Alter,O.(2006)Discoveryofprinciplesofnaturefrom mathe-maticalmodelingofDNAmicroarraydata.Proc.Natl.Acad.Sci. U.S.A.103,16063–16064

4. Heyn,P.etal.(2015)Intronsandgeneexpression:cellular constraints, transcriptional regulation, and evolutionary conse-quences. Bioessays37, 148–154

5. Ochs, M.F. and Fertig, E.J. (2012) Matrix factorization for transcriptional regulatory network inference. IEEE Symp. Comput. Intell. Bioinforma. Comput. Biol. Proc. 2012, 387–396

6. Abdi, H. etal.(2013) Multiple factor analysis: principal compo-nent analysis for multitable and multiblock data sets. WIREs Comp.Stat.5, 149–179

7. Meng, C. etal.(2016) Dimension reduction techniques for the integrative analysis of multi-omics data. Brief.Bioinform.17, 628–641

8. Li, Y. etal.(2016) A review on machine learning principles for multi-view biological data integration. Brief.Bioinform.19, 325– 340

9. Devarajan,K.(2008)Nonnegativematrixfactorization:an ana-lyticalandinterpretivetoolincomputationalbiology.PLoS Com-put.Biol.4,e1000029

10. Ritchie,M.E.etal.(2015)limmapowersdifferentialexpression analysesforRNA-sequencingandmicroarraystudies.Nucleic AcidsRes.43,e47

11. Anders, S. and Huber, W. (2010) Differential expression analysis for sequence count data. GenomeBiol.11, R106

12. Xie,F. etal. (2017)BayCount:aBayesiandecomposition

method for inferring tumor heterogeneity using RNA-Seq

counts.bioRxivPublishedonlineNovember13,2017.http://

dx.doi.org/10.1101/218511

13. Alexandrov, L.B. etal.(2013) Signatures of mutational pro-cesses in human cancer. Nature500, 415–421

14. Alexandrov,L.B.etal.(2016)Mutationalsignaturesassociated withtobaccosmokinginhumancancer.Science354,618–622

15. Favorov,A.V.etal. (2005)AMarkovchainMonteCarlo techniqueforidentiﬁcationofcombinationsofallelicvariants underlyingcomplex diseases inhumans. Genetics171, 2113–2121

16. Zakeri, M. etal.(2017) Improved data-driven likelihood factori-zations for transcript abundance estimation. Bioinformatics33, i142–i151

17. Bertagnolli, N.M. etal.(2013) SVD identiﬁestranscript length distribution functions from DNA microarray data and reveals evolutionary forces globally affecting GBM metabolism. PLoS One8, e78913

18. Peckner,R.etal.(2017)Specter:lineardeconvolutionasanew

paradigmfortargetedanalysisofdata-independentacquisition

mass spectrometry proteomics. bioRxiv Published online

September8,2017.http://dx.doi.org/10.1101/152744

19. Yeung, K.Y. etal.(2001) Model-based clustering and data trans-formations for gene expression data. Bioinformatics 17, 977–987

20. Jiang, D. etal.(2004) Cluster analysis for gene expression data: asurvey.IEEETrans.Knowl.DataEng.16,1370–1386

21. Venet,D.etal.(2001)Separationofsamplesintotheir con-stituents using gene expression data. Bioinformatics 17, S279–S287

22. Abbas,A.R.etal.(2009)Deconvolutionofbloodmicroarraydata identiﬁescellularactivationpatternsinsystemiclupus erythe-matosus. PLoSOne4, e6098

23. Erkkilä, T. etal.(2010) Probabilistic analysis of gene expression measurements from heterogeneous tissues. Bioinformatics26, 2571–2577

(15)

24. Leek, J.T. (2011) Asymptotic conditional singular value decom-position for high-dimensional genomic data. Biometrics67, 344–352

25. Kelton, C.J. etal.(2015) The estimation of dimensionality in gene expression data using nonnegative matrix factorization. 2015 IEEEInternationalConferenceonBioinformaticsand Biomedi-cine(BIBM)1642–1649

26. Meng, C. etal.(2016) Dimension reduction techniques for the integrative analysis of multi-omics data. Brief.Bioinform.17, 628–641

27. Berry, M.W. etal.(2007) Algorithms and applications for approx-imate nonnegative matrix factorization. Comput.Stat.DataAnal. 52, 155–173

28. Wang, Y.-X. andZhang, Y.-J. (2013) Nonnegative matrix factorization:acomprehensivereview.IEEETrans.Knowl.Data Eng.25,1336–1353

29. Zhou,G.etal.(2014)Nonnegativematrixandtensor factoriza-tions:analgorithmicperspective.IEEESignalProcess.Mag.31, 54–65

30. Lee, S.-I. and Batzoglou, S. (2003) Application of independent component analysis to microarrays. GenomeBiol.4, R76

31. Engreitz, J.M. etal.(2010) Independent component analysis: mining microarray data for fundamental human gene expression modules. J.Biomed.Bioinf.43, 932–944

32. Teschendorff, A.E. etal.(2007) Elucidating the altered transcrip-tional programs in breast cancer using independent component analysis. PLoSComput.Biol.3, e161

33. Ochs, M.F. etal.(1999) A new method for spectral decomposi-tion using a bilinear Bayesian approach. J.Magn.Reson.137, 161–176

34. Lee, D.D. and Seung, H.S. (1999) Learning the parts of objects by non-negative matrix factorization. Nature401, 788–791

35. Moloshok, T.D. etal.(2002) Application of Bayesian decompo-sitionforanalysingmicroarraydata.Bioinformatics18,566–575

36. Kossenkov,A.V.etal.(2007)Determiningtranscriptionfactor activityfrommicroarraydatausingBayesianMarkovchain MonteCarlosampling. Stud.HealthTechnol.Inform.129, 1250–1254

37. Mairal,J.etal.(2010)Onlinelearningformatrixfactorizationand sparse coding. J.Mach.Learn.Res.11, 19–60

38. Wu, S. etal.(2016) Stability-driven nonnegative matrix factori-zation to interpret spatial gene expression and build local gene networks. Proc.Natl.Acad.Sci.U.S.A.113, 4290–4295

39. Hyvärinen, A. and Oja, E. (2000) Independent component anal-ysis: algorithms and applications. NeuralNetw.13, 411–430

40. Fertig, E.J. etal.(2010) CoGAPS: an R/C++ package to identify patterns and biological process activity in transcriptomic data. Bioinformatics26, 2792–2793

41. Stein-O’Brien,G.L. etal.(2017) PatternMarkers & GWCoGAPS for novel data-driven biomarkers via whole transcriptome NMF. Bioinformatics33, 1892–1894

42. Dey, K.K. etal.(2017) Visualizing the structure of RNA-seq expression data using grade of membership models. PLoS Genet.13, e1006599

43. Biton, A. et al. (2014) Independent component analysis uncoversthelandscapeofthebladdertumortranscriptome andrevealsinsights intoluminalandbasal subtypes.Cell Rep.9,1235–1245

44. Fertig,E.J.etal.(2013)Preferentialactivationofthehedgehog pathwaybyepigeneticmodulationsinHPVnegativeHNSCC identiﬁedwith meta-pathway analysis. PLoSOne8, e78127

45. Bidaut, G. etal.(2010) Interpreting and comparing clustering experiments through graph visualization and ontology statistical enrichment with the ClutrFree Package. In BiomedicalInformatics forCancerResearch(Ochs, M.F., ed.), pp. 315–333,Springer

46. Bidaut, G. etal.(2006) Determination of strongly overlapping signaling activity from microarray data. BMCBioinformatics7, 99

47. Xu, Y. etal.(2015) MAD Bayes for tumor heterogeneity _–feature allocation with exponential family sampling. J.Am.Stat.Assoc. 110, 503–514

48. Novembre, J. etal.(2008) Genes mirror geography within Europe. Nature456, 98–101

49. Engelhardt, B.E. and Stephens, M. (2010) Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis. PLoSGenet.6, e1001117

50. McCarthy, M.I. etal.(2008) Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat. Rev.Genet.9, 356–369

51. Hackl, H. etal.(2016) Computational genomics tools for dis-secting tumour-immune cell interactions. Nat.Rev.Genet.17, 441–458

52. Fertig,E.J.etal.(2014)Patternidentiﬁcationintime-coursegene expressiondatawiththeCoGAPSmatrixfactorization.Methods Mol.Biol.1101,87–112

53. Nik-Zainal,S.etal.(2012)Thelifehistoryof21breastcancers. Cell149,994–1007

54. Roth, A. etal.(2014) PyClone: statistical inference of clonal population structure in cancer. Nat.Methods11, 396–398

55. Deshwar, A.G. etal.(2015) PhyloWGS: reconstructing subclo-nal composition and evolution from whole-genome sequencing of tumors. GenomeBiol.16, 35

56. Lee, J. etal.(2016) Bayesian inference for intratumour hetero-geneity in mutations and copy number variation. J.R.Stat.Soc. Ser.C.Appl.Stat.65, 547–563

57. Bar-Joseph, Z. etal.(2012) Studying and modelling dynamic biological processes using time-series gene expression data. Nat.Rev.Genet.13, 552–564

58. Liang, Y. and Kelemen, A. (2017) Dynamic modeling and

network approaches for omics time course data: overview of

compu-tationalapproachesandapplications.Brief.Bioinform.Published

onlineApril18,2017.http://dx.doi.org/10.1093/bib/bbx036

59. Moloshok,T.D.etal.(2002)ApplicationofBayesian decom-positionforanalysingmicroarraydata.Bioinformatics18, 566–575

60. Liebermeister,W.(2002)Linearmodesofgeneexpression determinedbyindependentcomponentanalysis.Bioinformatics 18, 51–60

61. Ochs, M.F. et al. (2009) Detection of treatment-Induced changes in signaling pathways in gastrointestinal stromal tumors using transcriptomic data. CancerRes.69, 9125–9132

62. Hill, S.M. etal.(2016) Inferring causal molecular networks: empirical assessment through a community-based effort. Nat. Methods13, 310–318

63. Stein-O’Brien,G.etal.(2017)Integratedtime-courseomics

analysisdistinguishesimmediatetherapeuticresponsefrom

acquiredresistance.bioRxivPublishedonlineAugust1,2017.

http://dx.doi.org/10.1101/136564

64. Khatri, P. etal.(2012) Ten years of pathway analysis: current approaches and outstanding challenges. PLoSComput.Biol.8, e1002375

65. Irizarry, R.A. etal.(2009) Gene set enrichment analysis made simple. Stat.MethodsMed.Res.18, 565–575

66. Bauer-Mehren,A.etal.(2009)Pathwaydatabasesandtoolsfor theirexploitation:beneﬁts,currentlimitationsandchallenges. Mol.Syst.Biol.5,290

67. Tsui,I.F.L.etal.(2007)Publicdatabasesandsoftwarefor thepathwayanalysisofcancergenomes.CancerInform.3, 379–397

68. The GTEx Consortium (2015) The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science348, 648–660

69. Tan, J. etal.(2017) Unsupervised extraction of stable expres-sion signatures from public compendia with an ensemble of neural networks. CellSyst.5, 63–71

(16)

70. Kim, J.W. etal.(2017) Decomposing oncogenic transcriptional signatures to generate maps of divergent cellular states. Cell Syst.5, 105–18.e9

71. Fertig, E.J. etal.(2012) Gene expression signatures modulated by epidermal growth factor receptor activation and their relationship to cetuximab resistance in head and neck squa-mous cell carcinoma. BMCGenomics13, 160

72. Fertig, E.J. etal.(2012) Identifying context-speciﬁctranscription factor targets from prior knowledge and gene expression data. IEEETrans.Nanobioscience12, 142–149

73. Segal, E. etal.(2004) A module map showing conditional activity of expression modules in cancer. Nat.Genet.36, 1090–1098

74. Subramanian, A. etal.(2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expressionproﬁles.Proc.Natl.Acad.Sci.102,15545–15550

75. Zhu,X.etal.(2017)Detectingheterogeneityinsingle-cell RNA-Seqdatabynon-negativematrixfactorization.PeerJ.5,e2888

76. DeTomaso,D.andYosef,N.(2016)FastProject:atoolfor low-dimensionalanalysisofsingle-cellRNA-Seqdata.BMC Bioin-formatics17, 315

77. Fertig, E.J. etal.(2013) Identifying context-speciﬁctranscription factor targets from prior knowledge and gene expression data. IEEETrans.Nanobiosci.12, 142–149

78. Irizarry, R.A. etal.(2005) Multiple-laboratory comparison of microarray platforms. Nat.Methods2, 345–350

79. Fan, J. etal.(2016) Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat. Methods3, 241–244

80. Trapnell, C. etal. (2014) The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat.Biotechnol.32, 381–386

81. Townes,F.W.etal.(2017)Varying-censoring awarematrix

factorizationforsinglecellRNA-sequencing.bioRxivPublished

onlineJuly21,2017.http://dx.doi.org/10.1101/166736

82. Moon,K.R.etal.(2017)PHATE:adimensionalityreduction

methodforvisualizingtrajectorystructuresinhigh-dimensional

biologicaldata.bioRxivPublishedonlineMarch24,2017.http://

dx.doi.org/10.1101/120378

83. Puram,S.V.etal.(2017)Single-velltranscriptomicanalysisof primary and metastatic tumor ecosystems in head and neck cancer. Cell171, 1611–1624

84. Hübschmann,D.etal.(2017)Decipheringprogramsof

tran-scriptionalregulationbycombineddeconvolutionofmultiple

omics layers. bioRxiv Published online October 8, 2017.

http://dx.doi.org/10.1101/199547

85. Buettner, F. etal.(2017) f-scLVM: scalable and versatile factor analysis for single-cell RNA-seq. GenomeBiol.18, 212

86. Buettner,F.etal.(2016)Scalablelatent-factormodelsappliedto

single-cellRNA-seqdataseparatebiologicaldriversfrom

con-foundingeffects.bioRxivPublishedonlineNovember15,2016.

http://dx.doi.org/10.1101/087775

87. vanDijk,D.etal.(2017)MAGIC:adiffusion-basedimputation

methodreveals gene-geneinteractions in single-cell

RNA-sequencing data. bioRxiv Published online February 25,

2017.http://dx.doi.org/10.1101/111591

88. Risso,D.etal.(2017)ZINB-WaVE:ageneralandﬂexiblemethodfor

signalextractionfromsingle-cellRNA-seqdata.bioRxivPublished

onlineNovember2,2017.http://dx.doi.org/10.1101/125112

89. Pierson,E.andYau,C.(2015)ZIFA:Dimensionalityreductionfor zero-inﬂatedsingle-cellgeneexpressionanalysis.GenomeBiol. 16, 241

90. van der, M.L. and Hinton, G. (2008) Visualizing data using t-SNE. J.Mach.LearnRes.9, 2579–2605

91. Alter, O. etal.(2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc.Natl. Acad.Sci.U.S.A.97, 10101–10106

92. Fellenberg, K. etal.(2001) Correspondence analysis applied to microarray data. Proc.Natl.Acad.Sci.U.S. A. 98, 10781–10786

93. Brunet, J.P. etal.(2004) Metagenes and molecular pattern discovery using matrix factorization. Proc.Natl.Acad.Sci. U.S.A.101, 4164–4169

94. Abdi, H. and Williams, L.J. (2010) Principal component analysis. WIREsComp.Stat.2, 433–459

95. Hyvärinen, A. etal.(2004) IndependentComponentAnalysis, John Wiley & Sons

96. Hardoon, D.R. etal.(2004) Canonical correlation analysis: an overview with application to learning methods. Neural.Comput. 16, 2639–2664

97. Schölkopf, B. etal.(1999) Kernel principal component analysis. In AdvancesInKernel Methods– SupportVectorLearning (Schölkopf, B., ed.), MIT Press, pp. 327–000

98. Arora,R.andLivescu,K.(2012)KernelCCAformulti-view learningofacousticfeaturesusingarticulatorymeasurements. InSymposiumonMachineLearninginSpeechandLanguage Processing.MSLP

99. Andrew,G.etal.(2013)Deepcanonicalcorrelationanalysis. Proceedingsofthe30thInternationalConferenceonMachine Learning1247–1255

100.Ding, C. and He, X. (2004) K-means clustering via principal component analysis. Proceedingsofthe21stInternational Con-ferenceonMachineLearning29

101.Arora, R. etal. (2011) Clustering by left-stochastic matrix factorization. Proceedingsofthe28thInternationalConference onMachineLearning28, 761–768

102.Kulis, B. (2013) Metric learning: a survey. Found.TrendsMach. Learn.5, 287–364

103.Pritchard, J.K. etal.(2000) Inference of population structure using multilocus genotype data. Genetics155, 945–959

104.De la Torre, F. and Black, M.J. (2003) A framework for robust subspace learning. Int.J.Comput.Vis.54, 117–142

105.Szeliski,R.(2010)ComputerVision:AlgorithmsandApplications.

http://szeliski.org/Book/drafts/SzeliskiBook_20100903_draft.pdf

106.Candès,E.J.etal.(2011)Robustprincipalcomponentanalysis? J.ACM58,11

107.Arora,R.etal.(2012)StochasticoptimizationforPCAandPLS. In50thAnnualAllertonConferenceonCommunicationControl, andComputing,pp.861–868,Allerton

108.Arora, R. etal.(2013) Stochastic optimization of PCA with capped MSG. In AdvancesinNeuralInformationProcessing Systems(Vol. 26) (Burges, C.J.C., ed.), In pp. 1815–1823, Curran Associates

109.Goes,J.etal.(2014)Robuststochasticprincipalcomponent

analysis. In Proceedings of the Seventeenth International

ConferenceonArtiﬁcialIntelligenceandStatistics(Kaski,S.

andCorander,J.,eds),pp.266–274,PMLR

110.Bickel, S. and Scheffer, T. (2004) Multi-view clustering. Proceedingsof theIEEE InternationalConferenceonData Mining19–26

111.Candès, E.J. and Recht, B. (2009) Exact matrix completion via convex optimization. FoundComut.Math.9, 717

112.Argyriou, A. etal.(2007) Multi-task feature learning. Adv.Neural. Inf.Process.Syst.19,41–48

113.Ando,R.K.andZhang,T.(2005)Aframeworkforlearning predictivestructuresfrommultipletasksandunlabeleddata. J.Mach.Learn.Res.6,1817–1853

114.Cleary,B.(2017) Compositemeasurementsandmolecular

compressedsensingforhighlyefﬁcienttranscriptomics.bioRxiv

PublishedonlineJanuary2,2017.http://dx.doi.org/10.1101/

091926

115.Aha, D.W. etal.(1991) Instance-based learning algorithms. Mach.Learn.6, 37–66

116.Arora, R. etal.(2013) Similarity-based clustering by left-stochas-tic matrix factorization. J.Mach.Learn.Res.14, 1715–1746

117.Liao, R. etal.(2014) CloudNMF: a MapReduce implementation of nonnegative matrix factorization for large-scale biological datasets. GenomicsProteomicsBioinf.12, 48–51

(17)

118.de Campos, C.P. etal.(2013) Discovering subgroups of patients from DNA copy number data using NMF on compacted matri-ces. PLoSOne8, e79720

119.Huang, S. etal.(2017) More is better: recent progress in multi-omics data integration methods. Front.Genet.8, 84

120.Hore, V. etal.(2016) Tensor decomposition for multiple-tissue gene expression experiments. Nat.Genet.48, 1094–1100

121.Durham, T.J. etal.(2018) PREDICTD parallel epigenomics data Imputation with cloud-based tensor decomposition. Nat. Commun.9, 1402

122.Zhu, Y. etal.(2016) Constructing 3D interaction maps from 1D epigenomes. Nat.Commun.7, 10812

123.Wang,M.etal.(2017)Three-wayclusteringofmulti-tissue

multi-individual gene expression data using constrained tensor

decomposition.bioRxivPublishedonlineDecember5,2017.

http://dx.doi.org/10.1101/229245

124.Kolda, T. and Bader, B. (2009) Tensor decompositions and applications. SIAMRev.51, 455–500

125.Mao, W. et al. (2017) Pathway-level information extractor (PLIER):a

generativemodelforgeneexpressiondata.bioRxivPublished

onlineDecember162017.http://dx.doi.org/10.1101/116061

126.Hofree, M. etal.(2013) Network-based stratiﬁcationof tumor mutations. Nat.Methods10, 1108

127.Liao, J.C. etal.(2003) Network component analysis: recon-struction of regulatory signals in biological systems. Proc.Natl. Acad.Sci.U.S.A.100, 15522–15527