HAL Id: hal-02887755
https://hal.archives-ouvertes.fr/hal-02887755
Submitted on 2 Jul 2020
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Matrix cofactorization for joint representation learning and supervised classification : application to
hyperspectral image analysis
Adrien Lagrange, Mathieu Fauvel, Stéphane May, José M. Bioucas-Dias, Nicolas Dobigeon
To cite this version:
Adrien Lagrange, Mathieu Fauvel, Stéphane May, José M. Bioucas-Dias, Nicolas Dobigeon.
Matrix cofactorization for joint representation learning and supervised classification : appli- cation to hyperspectral image analysis. Neurocomputing, Elsevier, 2020, 385, pp.132-147.
�10.1016/j.neucom.2019.12.068�. �hal-02887755�
OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible
Any correspondence concerning this service should be sent
to the repository administrator: tech-oatao@listes-diff.inp-toulouse.fr
This is an author’s version published in: https://oatao.univ-toulouse.fr/ 2 6328
To cite this version:
Lagrange, Adrien and Fauvel, Mathieu and May, Stéphane and Bioucas-Dias, José M. and Dobigeon, Nicolas Matrix cofactorization for joint representation learning and supervised classification : application to hyperspectral image analysis. (2020) Neurocomputing, 385. 132-147. ISSN 0925-2312 .
Official URL:
https://doi.org/10.1016/j.neucom.2019.12.068
Open Archive Toulouse Archive Ouverte
Matrix cofactorization for joint representation learning and supervised classification – Application to hyperspectral image analysis
RAdrien Lagrange
a,∗, Mathieu Fauvel
b, Stéphane May
c, José Bioucas-Dias
e, Nicolas Dobigeon
a,daUniversity of Toulouse, IRIT/INP-ENSEEIHT Toulouse, BP 7122, Toulouse Cedex 7 31071, France
bCESBIO, University of Toulouse, CNES/CNRS/INRA/IRD/UPS, BPI 2801, Toulouse Cedex 9 31401, France
cCNES, DCT/SI/AP, 18 Avenue Edouard Belin, Toulouse 31400, France
dInstitut Universitaire de France, France
eInstituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa, Lisbon 1049-001, Portugal
Keywords:
Image interpretation Supervised learning Representation learning Hyperspectral images Non-convex optimization Matrix cofactorization
a b s t ra c t
Supervisedclassificationandrepresentationlearningaretwowidelyusedclassesofmethodstoanalyze multivariateimages.Althoughcomplementary,thesemethodshavebeenscarcelyconsideredjointlyina hierarchicalmodeling.Inthispaper,amethodcouplingthesetwoapproachesisdesignedusingamatrix cofactorizationformulation.Eachtaskismodeledasafactorizationmatrixproblemandatermrelating bothcodingmatricesis thenintroduced todrivean appropriatecoupling.The linkcanbeinterpreted asaclusteringoperationoverthelow-dimensionalrepresentationvectors.Theattributionvectorsofthe clusteringarethenusedasfeaturesvectorsfortheclassificationtask,i.e.,thecodingvectorsofthecorre- spondingfactorizationproblem.Aproximalgradientdescentalgorithm,ensuringconvergencetoacritical pointoftheobjectivefunction,isthenderivedtosolvetheresultingnon-convexnon-smoothoptimiza- tionproblem.Anevaluationoftheproposedmethodisfinallyconductedbothonsyntheticandrealdata inthespecificcontextofhyperspectralimageinterpretation,unifyingtwostandardanalysistechniques, namelyunmixingandclassification.
1. Introduction
Numerous frameworkshavebeendevelopedtoefficientlyana- lyzetheincreasingamountofremotesensingimages[1,2].Among thosemethods,supervisedclassification hasreceived considerable attention leading to the development of current state-of-the-art classificationmethodsbasedonadvancedstatisticaltools,such as convolutionalneural networks[3–5], kernelmethods[6],random forest[7]orBayesianmodels[8].Inthecontextofremotesensing image classification,these methods aimat retrievingthe class of eachpixeloftheimagegivenaspecificclassnomenclature.Within
R Part of this work has been supported by Centre National d’Études Spatiales (CNES), Occitanie Region, EU FP7 through the ERANETMED JC-WATER program (project ANR-15-NMED-0 0 02-02 MapInvPlnt), by the ANR-3IA Artificial and Natural Intelligence Toulouse Institute (ANITI) and by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme un- der grant agreement No 681839 (project FACTORY)
∗ Corresponding author.
E-mail addresses: adrien.lagrange@enseeiht.fr (A. Lagrange),
mathieu.fauvel@inra.fr (M. Fauvel), stephane.may@cnes.fr (S. May), bioucas@lx.it.pt (J. Bioucas-Dias), nicolas.dobigeon@enseeiht.fr (N. Dobigeon).
a supervised framework, a set of pixels is assumed to be anno- tated byan expertandsubsequentlyusedasexamplesthrough a learningprocess.Thankstoextensiveresearch efforts ofthecom- munity, classificationmethods havebecome veryefficient.Never- theless, they still face some challengingissues, such asthe high dimensionofthedata,oftencoupledwiththelackoftrainingdata [9].Handlingmulti-modaland/orcomposite classeswithintrinsic intra-variabilityisalsoarecurrentissue[10]:forinstance,aclass referred to as building can gather very dissimilar samples when metallicandtiledroofsarepresentinascene.Besides,theresult- ing classification remains a high-level interpretation ofthe scene sinceitonlygivesasingleclasstosummarizeallinformationina givenpixel.
Hence,morerecentworkshaveemergedinordertoprovidea richer interpretation [11,12]. In particular, representationlearning methods assume that thedata resultsfrom thecomposition of a reduced number of elementary patterns. More precisely, the ob- servedmeasurements canbeapproximatedby mixturesofdictio- nary elementsable to simultaneously capture the variabilityand redundancy in the dataset. Representation learning can be tack- led fromdifferentperspectives, inparticularknown asdictionary
learning [13], source separation [14], compressive sensing [15], factoranalysis[16],matrixfactorization[17]orsubspace learning [18].Variousmodelshavebeenproposedtolearnadedicatedrep- resentationrelevanttothefieldofinterest,differingbyspecificas- sumptionsand/or constraints.Mostofthem attempttoidentifya dictionary andamixturefunction byminimizing areconstruction error measuring the discrepancy betweenthe chosen model and thedataset.Forinstance,non-negativematrixfactorization(NMF) aimsatrecoveringalinearmixtureofnon-negativeelementswith non-negativeactivation coefficientsleadingto additivepart-based decompositionsoftheobservations[19,20].Contrarytoaclassifica- tiontask,representationlearningmethodshavegenerallythegreat advantageofbeingunsupervised.However,forparticularpurposes, they canbespecializedtolearnarepresentationsuitedforapar- ticulartask,e.g. classificationorregression[21].Thus, representa- tion learningprovides a rich yetcompact descriptionof thedata whereas supervised classification offers a univocal interpretation basedonpriorknowledgefromexperts.
Theideaofcombiningtherepresentationlearningandclassifi- cationtaskshasalreadybeenconsidered,mostlytousetherepre- sentationlearningmethodasadimensionalityreductionstepprior to the classification [22], wherethe low-dimensional representa- tionisusedasinputfeatures. Nonetheless,someworksintroduce theidea ofperforming thetwotaskssimultaneously [23].Forex- ample,thediscriminativeK-SVDalgorithmassociatesalinearmix- turemodeltoalinearclassifier[24].Attheend,themethodtries tolearn adictionarywell-fittedfortheclassificationtask,i.e.,the learnedrepresentationminimizesthereconstructionerrorbutalso ensuresagoodseparabilityoftheclasses.Moreintertwinedframe- workscanbe alsoconsidered,astheoneproposed in[25]where elements of the dictionary are class-specific. Joint representation learning and classification can be castas a cofactorization prob- lem. Both tasks are interpreted as individual factorization prob- lems and constraintsbetween thedictionaries andcoding matri- cesassociatedwiththetwoproblemscanthenbeimposed.These cofactorization-basedmodelshaveproventobehighlyefficientin manyapplicationfields,e.g.fortextmining[26],musicsourcesep- aration[27],orimageanalysis[28,29].
However, most of the available methods tend to focus on classificationresultsandgenerallyoppose reconstructionaccuracy and discriminative abilitiesof the models instead of designing a unifyinghierarchicalstructure.Capitalizingonrecentadvancesand a first attempt in[30] ina Bayesian setting, thispaper proposes a particular cofactorization method, witha dedicatedapplication to multivariate image analysis. The representation learning and classification tasks are related through the coding matrices of the two factorizationproblems. Aclustering isperformedon the low-dimensional representation and the clustering attribution vectorsareusedascodingvectorsfortheclassification.Thisnovel coupling approach produces a coherent and fully-interpretable hierarchicalmodel.Tosolvetheresultingnon-convexnon-smooth optimization problem, a proximal alternating linearized min- imization (PALM) algorithm is derived, yielding guarantees of convergencetoacriticalpointoftheobjectivefunction[31].
The maincontributionsreportedin thispapercan besumma- rizedasfollows.Ageneric frameworkisproposedtodemonstrate that two ubiquitous image analysis methods, namely supervised classification and representation learning, can be unified into a unique jointcofactorizationproblem.Thisframework isinstanced forone particular applicationinthe context ofhyperspectral im- ageanalysiswheresupervisedclassificationandspectralunmixing are performedjointly.The proposed methodoffers a comprehen- sive andmeaningfulanalysisof theimage aswell ascompetitive quantitativeresultsforthetwoconsideredtasks.
This paper is organizedas follows. Section 2 defines the two factorization problems used to perform representation learning
and classification and further discusses the joint cofactorization problem. It also details the optimization scheme developed to solve the resulting non-convex minimization problem. To illus- trate the generic framework introduced in the previous section, an application to hyperspectral image analysis is conducted in Section 3 through the dual scope ofspectral unmixing andclas- sification. Performance of the proposed framework is illustrated thanks to experiments conducted on synthetic and real data in Section4.Finally,Section5concludesthepaperandpresentssome researchperspectivestothiswork.
2. Proposedgenericframework
Therepresentationlearningandclassificationtasks aregeneri- callydefinedasfactorizationmatrixproblemsinSections2.1and 2.2. To derive a unified cofactorization formulation, a third step consistsindrawingthelinkbetweenthesetwoindependentprob- lems.Inthiswork,thiscouplingisensuredby imposingaconsis- tentstructure betweenthetwo codingmatricescorresponding to the low-dimensional representationand thefeature matrices, re- spectively.AsdetailedinSection2.3,itisexpressedasaclustering taskwheretheparameters describing theattributionto theclus- tersare the featurevectors, i.e.thecoding matrix resultingfrom theclassificationtask.Particularinstancesofthesethreetaskswill be detailed in Section 3 for an application to multiband image analysis.
2.1. Representationlearning
Thefundamentalassumptioninrepresentationlearningisthat the P considered L-dimensional samples, gathered in matrix Y∈ RL×P, belong to a R-dimensional subspace such that R L. The aim is then to recover this manifold, where samples can be ex- pressedascombinationsofelementaryvectors,hereinthecolumn ofthematrixW∈RL×Rsometimesreferredtoasdictionary.These samplescan be subsequentlyrepresented thanks tothe so-called coding matrix H∈RR×P. Formally, identifying the dictionary and thecodingmatricescanbegenerallyexpressedasaminimization problem
minW,HJr(Y
| ψ
(W,H))+λ
wRw(W)+ıW(W)+λ
hRh(H)+ıH(H) (1) whereψ(·) isamixturefunction(e.g.,linearorbilinearoperator), Jr(·)is an appropriate costfunction, forexamplederived from aβ-divergence[32],R·(·)denotepenalizationsweightedbythepa- rameterλ· andı· (·) istheindicator functionsdefinedhereon the respective sets W⊂RL×R and H⊂RR×P imposing some con- straintsonthedictionaryandcodingmatrices.
In the case of a linear embedding adopted in this work, the mixturefunctionwrites
ψ
(W,H)=WH. (2)In thiscontext, the problem (1)can be castas a factor anal- ysisdriven bythe cost functionJr(·). Dependingon theapplica- tive field, typical data-fittingmeasures include the Itakura-Saito, the Euclidean andthe Kullback–Leiblerdivergences [32]. Assum- inga low-rank model(i.e.,R ≤L),specific choicesforthe sets H andWleadtovariousstandard factormodels.Forinstance,when W is chosen asthe Stiefel manifold, the solution of (1) is given byaprincipalcomponentanalysis(PCA)[33].WhenWandHim- posenonnegativityofthedictionary andcodingmatrixelements, theproblemisknownasnonnegativematrixfactorization[19,34]. Withina supervised context, the dictionary W can be chosen thanks toa end-user expertiseorestimatedbeforehand.Without lossof generality butforthe sake ofconciseness, theframework
describedinthispaperassumesthatthisdictionaryisknown,pos- siblyovercompleteasproposedintheexperimentalillustrationde- scribedinSection4.Inthiscase,asinmanyapplications,itmakes senseto lookfora sparse representationofthesignal ofinterest toretrieveitsmostachievablecompactrepresentation[21,35].Fol- lowingthisstrategy, we propose toconsider an 1-norm sparsity penalizationonthecodingvectors,leadingtorepresentationlearn- ingtaskdefinedby
minH Jr(Y
|
WH)+λ
hH1+ıH(H) (3)whereH1=P
p=1hp1 withhpdenotingthepthcolumnofH. 2.2.Supervisedclassification
Toclearlydefinetheclassificationtask,letfirstintroducesome key notations. The index subset of samples with an available groundtruth is denoted as L while the index subset of unla- beled samples is U such that L∩U=∅ and L∪U=P with P
{1,...,P}.Classifying theunlabeled samplesconsistsin assigning eachofthemtooneoftheCclasses.Thiscanbe reformulatedas theestimationofaC× PmatrixCwhosecolumnscorrespondto unknown C-dimensional attribution vectors cp=
c1,p,. . .,cC,pT
. Eachvector ismadeof0exceptforci,p=1when thepthsample isassignedtheithclass.
Numerous classification ruleshave beenproposed inthe liter- ature[36]. Mostof them rely on a K × P matrix Z=[z1,...,zP] of features zp (p∈P) associated with each sample and derived fromtherawdata.Withinasupervisedframework,theattribution matrixCLandfeaturematrixZLofthelabeleddataareexploited during the learning step, where ·L denotes the corresponding submatrixwhose columnsare indexedby L.Fora wide rangeof classifiers,derivingaclassificationrulecanbeachievedbysolving theoptimizationproblem
minQ Jc(CL
| φ
(Q,ZL))+λ
qRq(Q) (4)whereQ∈RC×K is theset ofclassifier parameters to be inferred, Rq(·) refer to regularizations imposed on Q and Jc is a cost function measuring the quality of the classification such as the quadratic loss [24] or cross-entropy [37]. Moreover, in (4), φ(Q,
· ) defines a element-wise nonlinear mapping between the fea- tures and the class attribution vectors parametrized by Q, e.g., derivedfrom asigmoid ora softmaxoperators.In thiswork, the classifier is assumed to be linear, which leads to a vector-wise post-nonlinearmapping
φ
(Q,ZL)=φ
(QZL) (5)with
φ
(X)=[φ
(x1),. . .,φ
(xp)]. (6)Once the classifier parameters have been estimated by solv- ing (4), the unknown attributionvectorsCU can be subsequently inferredduringthetestingstepbyapplyingthenonlineartransfor- mationtothecorrespondingpredictedfeaturesZˆUassociatedwith theunlabeledsamples. Theobtained outputs are relaxedattribu- tionvectorscˆp=φ(Qzˆp)(p∈U)andthemostprobablepredicted sampleclasscanbecomputedasargmaxici,p.
Under the proposed formulation ofthe classification task,the learning and testing steps can be conducted simultaneously, a framework usually referred to as semi-supervised, withthe ben- eficial opportunity to introduce additional regularizations and/or constraintson the submatrix ofunknown attribution vectorsCU. Theinitialproblem(4)isthusextendedtothefollowingone minQ,CUJc(C
| φ
(QZ))+λ
qRq(Q)+λ
cRc(C)+ıC(CU) (7)Fig. 1. Structure of the cofactorization model. Variables in blue stand for observa- tions or available external data. Variables in olive green are linked through the clus- tering task here formulated as an optimization problem. The variable in a dotted box is assumed to be known or estimated beforehand in this work.
whereC=[CLCU]andC⊂RC×|U|denotesafeasiblesetfortheat- tributionmatrixCU.Asdiscussedabove,thecostfunctionJc(C|Cˆ) measures the actual classification loss, i.e., the discrepancy be- tween the attribution vector C of the trainingset andthe attri- bution vectors Cˆ predictedby the classifier. Twoparticular cases fittingthisgenericmodelareprovidedinSections3.2.1and3.2.2. The attribution vectors are defined as Cˆ=φ(QZˆ) where φ(·) is a nonlinear function applied to the output of a linear classifier.
The regularization term Rq(Q) penalizes over the parameters of theclassifiers.Atypicalexampleisaquadraticpenalizationwhich aimsatavoidingoverfitting,asconventionallydonewhenoptimiz- ingneuralnetworksandgenerallyreferredtoasweightdecay[38]. Finally, theregularization term Rc(C) penalizes over theattribu- tion matrix.Typicalexamplesinclude spatialregularizationssuch astotalvariation(TV)whendealingwithimageclassification.The indicator functionıC(CU)enforces sum-to-oneandnon-negativity constraints such that each attribution vector cp (p∈U) can then beinterpretedasaprobabilityvectorofbelongingtoeachclass.In suchacase,thefeasiblesetischosenasC=S|CU|where
SC
u∈RC
∀
k, uk≥0and Ck=1
uk=1
. (8)
2.3. Couplingrepresentationlearningandclassification
Up to this point, the representation learning and supervised classificationtaskshavebeenformulatedastwoindependentma- trixfactorizationproblemsgivenby(3)and(7),respectively.This work proposes to join them by drawing an implicit relation be- tweentwofactorsinvolvedinthesetwoproblems.Inspiredbyhi- erarchicalBayesianmodelssuchastheoneproposedin[30],both problemsarecoupledthroughtheactivationmatricesHandZ,as illustrated in Fig. 1. More precisely, the coding vectors in H are clusteredsuchthatthefeaturevectorsinZaredefinedastheat- tribution vectors to the K clusters. Ideally, clustering attribution vectorszp arefilledwithzerosexceptforzk,p=1whenhp isas- sociated withthekthcluster. Thus, thevectorszp (p∈P) are as- sumedtobedefinedontheK-dimensionalprobabilitysimplexSK
similarly definedas(8)and ensuringnon-negativity andsum-to- one constraints. Many clustering algorithms can be expressed as optimizationproblemsuch asthe well-knownk-means algorithm and many of its variants [39,40]. Adopting this formulation,and denoting θ the setofparameters of theclusteringalgorithm, the
Table 1
Overview of notations.
Parameter
P ∈ R Number of observations L ∈ R Dimension of observations C ∈ R Number of classes K ∈ R Number of features/clusters P = {1 , . . . , P } Index set of observations L ⊂P Index set of labeled samples
L i ⊂L Index set of labeled samples in the i th class U = P\L Index set of unlabeled samples
Y ∈ R L×P Observations W ∈ R L×R Dictionary H ∈ R R×P Coding matrix Q ∈ C C×P Classifier parameters
C L∈ R C×|L| Attribution matrix of labeled data C U ∈ R C×|U| Attribution matrix of unlabeled data C = [ C LC U] Class attribution matrix
Z ∈ R K×P Cluster attribution matrix θ∈ Clustering parameters
clusteringtaskcanbedefinedastheminimizationproblem min
Z,θ Jg(H,Z;
θ
)+λ
zRz(Z)+λ
θRθ(θ
)+ıSPK(Z)+ı(
θ
) (9)wheredefinesafeasiblesetfortheparametersθ.
It is worth noting that introducing this coupling term is one of the major novelty of the proposed approach. When consid- ering task-driven dictionary learning methods, it is usual to in- tertwine the representation learning and the classification tasks by directlyimposing H=Z[24,41].Sincethesemethods generally relyonalinearclassifier,onemajordrawback ofsuch approaches is their unability to deal with non-separable classes in the low- dimensional representation space. In such cases, the underlying modelcannotbediscriminativeanddescriptivesimultaneouslyand theresultingtasksbecomeadversarial.Whenconsideringthepro- posed coupling term, the cluster attribution vectors zp offer the possibility of linearly separating any group of clusters from the others.Asaconsequence,themodelbenefitsfrommoreflexibility, withbothdiscriminativeanddescriptiveabilitiesinamoregeneral sense.
2.4. Globalcofactorizationproblem
Unifyingtherepresentationlearningtask(3)andthe classifica- tiontask(7)throughtheclusteringtask(9)leadstothefollowing jointcofactorizationproblem
Hmin,Q,CU, Z,θ
λ
0Jr(Y|
WH)+λ
hH1+
λ
1Jc(C| φ
(QZ))+λ
qRq(Q)+λ
cRc(C) +λ
2Jg(H,Z;θ
)+λ
zRz(Z)+λ
θRθ(θ
) +ıH(H)+ıS|U|K (CU)+ıSP
K(Z)+ı(
θ
) (10)where λ0,λ1 and λ2 control therespective contribution of each taskdata-fittingterm.Allnotationsandparameterdimensionsare summarizedin Table1.Ageneric algorithmicscheme solvingthe problem(10)isproposedinthenextsection.
2.5. Optimizationscheme
Theminimization problemdefinedby (10)isnotgloballycon- vex.Toreachalocalminimizer,weproposetoresorttotheprox- imal alternating linearized minimization (PALM) algorithm intro- ducedin[31].Thisalgorithm isbasedon proximaldescent steps, which allows non-smooth terms to be handled. Moreover it is guaranteed to converge to a critical point of the objective func- tion evenin thecaseofnon-convexproblem. Thismeans that, if
theinitializationisgoodenough, itisexpectedtolikelyconverge toasolutionclosetotheglobaloptimum.ToimplementPALM,the problem(10)isrewrittenintheformofanunconstrainedproblem expressedasasumofa smoothcouplingtermg(·)andseparable non-smoothtermsfj(·)(j∈{0,...,4})asfollows
min
H,θ,Z, Q,CU
f0(H)+f1(
θ
)+f2(Z)+f3(CU)+g(H,θ
,Z,CU,Q) (11)where
f0(H)=ıH(H)+
λ
hH1 f2(Z)=ıSPK(Z) f1(
θ
)=ı(θ
) f3(CU)=ıS|U|K (CU)
andthecouplingfunctionis g(H,
θ
,Z,CU,Q)=λ
0Jr(Y|
WH)+
λ
1Jc(C| φ
(QZ))+λ
qRq(Q)+λ
cRc(C) +λ
2Jg(W,Z;θ
)+λ
zRz(Z)+λ
θRθ(θ
). (12) To ensure the stated guarantees of PALM, all fj(·) have to be proper, lower semi-continuous function fj:Rnj→(−∞,+∞], whichensures inparticularthatthe associatedproximaloperator iswell-defined.Additionally,sufficientconditionson thecoupling function are that g(·) is a C2 function (i.e., withcontinuous first andsecond derivatives)andthat its partial gradientsare globally Lipschitz.Forexample, partialgradient ∇Hg(H,θ,Z,CU,Q) should begloballyLipschitzforanyfixedθ,Z,CU,Q,thatis∇
Hg(H1,θ
,Z,CU,Q)−∇
Hg(H2,θ
,Z,CU,Q)≤LH(
θ
,Z,CU,Q)H1−H2,∀
H1,H2∈RR×P (13)whereLH(θ,Z,CU,Q),simplydenotedLHhereafter,istheLipschitz constant. Forsake of conciseness,we referto [31] to get further details.
Themainideaofthealgorithmisthentoupdateeachvariable oftheproblemalternativelyusingaproximalgradientdescent.The overallschemeissummarizedinAlgorithm1.Forapracticalim-
Algorithm1:PALM.
1 Initialize variables H 0, θ0, Z 0, C U0and Q 0; 2 Set α> 1 ;
3 while stopping criterion not reached do
4 H k+1∈ prox αf0LH(H k−α1LH∇Hg(H k, θk, Z k, C kU, Q k)); 5 θk+1∈ prox αf1Lθ(θk−α1Lθ∇θg(H k+1, θk, Z k, C kU, Q k)); 6 Z k+1∈ prox αf2LZ(Z k−α1LZ∇Zg(H k+1, θk+1, Z k, C kU, Q k)); 7 Q k+1∈ prox αf3LQ(Q k−α1LQ∇Qg(H k+1, θk+1, Z k+1, C Uk, Q k)); 8 C kU+1∈ prox αf4LCU(C kU−αL1C
U∇CUg(H k+1, θk+1, Z k+1, C kU, Q k+1)); 9 end
10 return H end, θend, Z end, Q end, C endU
plementation, one needs to compute thepartial gradients ofg(·) explicitlyandtheir Lipschitz constantsto perform a gradientde- scent step, followed by a proximal mapping associated with the non-smoothterms fj(·). The objective function is then monitored ateachiteration andthealgorithm isstoppedwhen convergence is reached. Note that, when a specific penalization R·(·) is non- smoothornon-gradient-Lipschitz,itispossibletomoveitintothe correspondingindependenttermfj(·)toensuretherequiredprop- ertyofthecouplingfunctiong(·).Thisisforinstancethecasefor thesparse penalization used over H whichhas been moved into f0(·). Nonetheless, asmentionedabove, the proximal operatoras- sociated witheach fj(·) is needed. Thus, even when the function consistsofseveralterms,aclosed-formexpressionofthisoperator shouldbeknown.Alternatively,oneshouldbeabletocomposethe proximaloperatorsassociatedwitheachtermoffj(·)[42].
Fig. 2. Spectral unmixing concept (source US Navy NEMO).
3. Application:hyperspectralimagesanalysis
Ageneralframeworkhasbeenintroduced intheprevioussec- tion.Asanillustration,aparticularinstanceofthisgenericframe- work is now considered, where explicit representation learning, classificationandclusteringareintroduced.Thespecificcaseofhy- perspectralimagesanalysisisconsideredforthisusecaseexample.
Contrarytoconventionalcolorimagingwhichonlycapturesthe reflectance measure forthree wavelengths (red, blue, green), hy- perspectralimagingmakesitpossibletomeasurereflectanceofthe observedsceneforseveralhundredsofwavelengthsfromvisibleto invisibledomain.Eachpixeloftheimagecanthusbe represented asavectorofreflectance,calledspectrum,whichcharacterizesthe observedmaterial.
Onedrawbackofhyperspectralimagesisusuallyaweakerspa- tial resolution due to sensor limitations. The direct consequence of this poor spatial resolution is the presence of mixed pixels, i.e.,pixelscorrespondingtoareascontainingseveralmaterials.Ob- servedspectra arein thiscasetheresult ofa specificmixture of theelementaryspectra,calledendmembers, associatedwithindi- vidualmaterialspresentinthepixel.Theproblemofretrievingthe proportionsofeachmaterialineachpixelisreferredtoasspectral unmixing[11].Thisproblemcanbeseenasaspecificcaseofrep- resentationlearningwherethedictionaryiscomposedofthesetof endmembersstandingfortheendmember spectraandthecoding matrixistheso-calledabundancematrixcontainingtheproportion ofeachmaterialineachpixel.
Spectral unmixing is introduced as a representation learning taskinSection 3.1.The specificclassifierusedforthisapplication isthenexplainedinSection3.2andfinallySection3.3presentsthe clusteringadoptedtorelatetheabundancematrixandtheclassifi- cationfeaturematrix.
3.1.Spectralunmixing
Asexplained,eachpixelofanhyperspectralimageischaracter- ized by a reflectance spectrum that physics theory approximates asa combination of endmembers, each corresponding to a spe- cificmaterial,asillustrated in Fig.2. Formally,inthis applicative scenario,the L-dimensional sampleyp denotes the L-dimensional spectrumofthepthpixelofthehyperspectralimage(p∈P).Each observationvectorsyp canbeexpressedasafunctionoftheend- membermatrix W(containingthe Relementaryspectra) andthe abundancevectorhp∈RRwithRL.
In the case of the most commonly adopted linear mixture model, each observation yp is assumed to be a linear combina-
tionoftheendmemberspectrawr(r=1,...,R)corruptedbysome noise,underlyingthe linearembedding(2).Assuming aquadratic data-fittingterm,thecostfunctionassociatedwiththerepresenta- tionlearningtaskin(1)iswritten
Jr(Y
|
WH)=12
Y−WH2F. (14)The abundance vector hp is usually interpreted asa vector of proportionsdescribingtheproportionofeachelementarycompo- nent in thepixel. Thus, to derive an additive composition of the observed pixels, a nonnegative constraint is considered for each element of theabundance matrix H, i.e., H=RR+×P. In thiswork, no sum-to-one constraintis considered since it has been argued thatleavingthisconstraintoutoffersabetteradaptationtopossi- blechanges ofilluminationinthe scene[43].Additionally,asthe endmember matrix W is the collection of reflectance spectra of theendmembers,itisalsoexpectedtobenon-negative.Whenthis dictionaryneedstobeestimated,theresultingproblemisasparse non-negative matrix factorization (NMF) task. When the dictio- naryisknownorestimatedbeforehand,theresultingoptimization problemisthenonnegativesparsecodingproblem
minH
1
2
Y−WH2F+λ
hH1+ıRR×P+ (H) (15)
wherethe sparsitypenalization actuallysupports the assumption thatonlyafewmaterialsarepresentinagivenpixel.
3.2. Classification
In the considered application, two loss functions associated with the classification problem have been investigated, namely quadraticlossandcross-entropyloss.Oneadvantageofthesetwo loss functions is that they can be used in a multi-class classi- fication (i.e., withmore than two classes). Moreover, this choice mayfulfill therequired conditions statedin Section 2.5 to apply PALM since,coupled withan appropriate φ(·) function, both loss costs are smooth and gradient-Lipschitz according to each esti- matedvariables.
3.2.1. Quadraticloss
Thequadraticlossisthemostsimplewaytoperformaclassifi- cationtaskandhavebeenextensivelyused[25,44,45].Itisdefined as
Jc(C
|
Cˆ)=12
CD−CˆD
2
F (16)
where Cˆ denotes the estimated attribution matrix. In (16), the P × P matrix D is introduced to weight the contribution of the labeled data withrespect tothe unlabeledone and to dealwith the case of unbalanced classes in the training set. Weights are chosen to be inversely proportional to class frequencies in the input data. The weight matrix is defined as the diagonal matrix D=diag[d1,...,dP]with
dp=
⎧ ⎨
⎩
1|Li|, if p∈Li;
1|U|, if p∈U; (17)
where Li denotes the set of indexes of labeled pixels of the ith class(i=1,...,C).Thus,consideringalinearclassifier,thegeneric classificationproblemin(7)canbespecifiedforthequadraticloss minQ,CU
1
2
CD−QZD2F+λ
cRc(C)+ıS|U|C (CU) (18)
wherenoadditionalconstraintsnorpenalizationisappliedtothe classifierparametersQ.Besides,whensamplesobeyaspatiallyco- herentstructure,asitisthecasewhenanalyzinghyperspectralim- ages,it isoftendesirabletotransfer thisstructureto theclassifi- cationmap.Such acharacteristicscanbe achievedby considering
a spatialregularization Rc(C)applied to the attributions vectors.
Followingthisassumption,thisworkconsidersaregularizedcoun- terpart of theweightedvectorial totalvariation (vTV), promoting a spatially piecewise constant behavior of the classification map [46]
CvTV=m,n
β
m,n[
∇
hC]m,n2
2+
[
∇
vC]m,n2
2+ (19)
where(m,n) are thespatialpositionpixelindexes and[∇h(·)]m,n
and [∇v(·)]m,n stand for horizontal andvertical discrete gradient operatorsevaluatedatagivenpixel,1respectively,i.e.,
[
∇
hC]m,n=c(m+1,n)−c(m,n) [∇
vC]m,n=c(m,n+1)−c(m,n).Theweightsβm,ncanbecomputedbeforehandtoadjustthepe- nalizationswithrespecttoexpectedspatialvariationsofthescene.
Theycan beestimateddirectlyfromtheimage to beanalyzed or extracted froma complementary datasetas in[47]. Theywill be specifiedduringtheexperimentsreportedinSection4.Moreover, the smoothing parameter > 0 ensures the gradient-Lipschitz propertyofthecouplingtermg(·),asrequiredinSection2.5.
3.2.2. Cross-entropyloss
The quadratic loss has the advantage to be expressed sim- ply andthe associated Lipschitz constant ofthe partial gradients are triviallyobtained. However, thisloss functionis knownto be highlyinfluencedby outlierswhich canresultinadegraded pre- dictive accuracy [48]. A more sophisticated way to conduct the classificationtaskistoconsideracross-entropyloss
Jc(C
|
Cˆ)=−p∈P
d2p
i∈C
ci,plog
cˆi,p
(20)combinedwithalogisticregression,i.e.,wherethenonlinearmap- ping(5)iselement-wisedefinedas
[
φ
(X)]i,j= 11+exp(−xi,j)=sigm(xi,j) (21)
withi∈{1,...,C} and p∈P.Thisclassifier can actually be inter- pretedasaone-layerneuralnetworkwithasigmoidnon-linearity.
Cross-entropy loss is indeeda very conventional loss function in theneural network/deeplearningcommunity[38].Inthe present case,thecorrespondingoptimizationproblemcanbewritten minQ,CU−
p∈P
d2p
i∈C
ci,plog(sigm(qi:zp))) +
λ
qRq(Q)+λ
cCvTV+ıS|U|C (CU) (22)
whereqi:∈R1×K denotesthe ithlineofthematrixQ.Thepenal- izationRq(Q)isherechosenasRq(Q)= 12Q2Ftopreventtheloss functiontoartificiallydecreasewhenqi:2isincreasing.Thisreg- ularizationhasbeenextensivelystudiedintheneuralnetworklit- erature where it is referred to asweight decay [38]. In (22),the regularization Rc(CU)applied to theattribution matrixis chosen againasavTV-likepenalization(see(19)).
3.3. Clustering
Fortheconsideredapplication,theconventionalk-meansalgo- rithm hasbeenchosenbecause ofitsstraightforwardformulation asan optimization problem. Bydenotingθ={B} a R × Kmatrix collecting K centroids, theclusteringtask(9) canbe rewritten as thefollowingNMFproblem[40]
minZ,B
1
2
H−BZ2F+λ
zRz(Z)+ıSPK(Z)+ıRR×K
+ (B) (23)
1With a slight abuse of notations, c (m,n)refers to the p th column of C where the p th pixel is spatially indexed by ( m, n ).
where Rz(Z) should promote Z to be composed of orthogo- nallines. Combined withthe nonnegativityand sum-to-onecon- straints,itwouldensure thatzp isavectorofzerosexceptforits kthcomponentequalto1,i.e.,meaningthatthepthpixelbelongs to the kthcluster. However, handlingthis orthogonality property withinthePALMoptimizationschemedetailedinSection2.5isnot straightforward,inparticularbecausetheproximal operatorasso- ciated to thispenalization cannot be explicitly computed. In this work,weproposetoremovethisorthogonalityconstraintsincere- laxedattributionvectorsmaybericherfeaturevectorsfortheclas- sificationtask.
3.4.Multi-objectiveproblem
Based on the quadratic and cross-entropy loss functions con- sideredinthe classificationtask,two distinct globaloptimization problems are obtained. When considering the quadratic loss of Section3.2.1,themulti-objectiveproblem(10)writes
minH,Q,Z CU,B
λ
02
Y−WH2F+λ
hH1+ıRR×P + (H) +λ
12
CD−QZD2F+λ
cCvTV+ıS|U| C (CU) +λ
22
H−BZ2F+ıSPK(Z)+ıRR×K
+ (B). (24)
Instead,whenconsideringthe cross-entropyloss functionpro- posed in Section 3.2.2, the optimization problem (10) is defined as
minH,Q,Z CU,B
λ
02
Y−WH2F+λ
hH1+ıRR×P + (H)−
λ
12
p∈P
d2p
i∈C
ci,plog(sigm(−qi:zp)))
+
λ
q2
Q2F+λ
cCvTV+ıS|U| C (CU) +λ
22
H−BZ2F+ıSPK(Z)+ıRR+×K(B). (25)
Bothproblems are particular instances of nonnegative matrix co-factorization [27,28]. Tosummarize, the hyperspectral pixel is firstdescribedasacombinationofelementaryspectrathroughthe learningrepresentationstep,akaspectral unmixing.Then, assum- ingthatthereexistgroupsofpixelsresultingfromthesamemix- tureofmaterials,a clusteringisperformedamongtheabundance vectors. And finally, attribution vectors to the clusters are used as feature vectors for the classification supporting the idea that classesaremadeofamixtureofclusters.Forbothmulti-objective problems(24)and(25),allconditionsrequiredtotheuseofPALM algorithm describedinSection 2.5 aremet. Detailsregarding the twooptimizationschemesdedicatedtothesetwoproblemsarere- portedintheAppendix.
3.5.Complexityanalysis
Regarding the computational complexity of the proposed Algorithm1,deriving thegradientsshowsthat itisdominatedby matrixproductoperations.Ityieldsthatthealgorithmhasanover- allcomputationalcostinO(NK2P)whereNisthenumberofiter- ations.
4. Experiments
4.1. Implementationdetails
Before presenting the experimental results, it is worth clari- fying the choices which havebeen made regarding the practical