Matrix cofactorization for joint representation learning and supervised classification : application to hyperspectral image analysis

(1)

HAL Id: hal-02887755

https://hal.archives-ouvertes.fr/hal-02887755

Submitted on 2 Jul 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Matrix cofactorization for joint representation learning and supervised classification : application to

hyperspectral image analysis

Adrien Lagrange, Mathieu Fauvel, Stéphane May, José M. Bioucas-Dias, Nicolas Dobigeon

To cite this version:

Adrien Lagrange, Mathieu Fauvel, Stéphane May, José M. Bioucas-Dias, Nicolas Dobigeon.

Matrix cofactorization for joint representation learning and supervised classification : appli- cation to hyperspectral image analysis. Neurocomputing, Elsevier, 2020, 385, pp.132-147.

�10.1016/j.neucom.2019.12.068�. �hal-02887755�

(2)

OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible

Any correspondence concerning this service should be sent

to the repository administrator: tech-oatao@listes-diff.inp-toulouse.fr

This is an author’s version published in: https://oatao.univ-toulouse.fr/ 2 6328

To cite this version:

Lagrange, Adrien and Fauvel, Mathieu and May, Stéphane and Bioucas-Dias, José M. and Dobigeon, Nicolas Matrix cofactorization for joint representation learning and supervised classification : application to hyperspectral image analysis. (2020) Neurocomputing, 385. 132-147. ISSN 0925-2312 .

Official URL:

https://doi.org/10.1016/j.neucom.2019.12.068

Open Archive Toulouse Archive Ouverte

(3)

Matrix cofactorization for joint representation learning and supervised classiﬁcation – Application to hyperspectral image analysis

^R

Adrien Lagrange

^a^,^∗

, Mathieu Fauvel

^b

, Stéphane May

^c

, José Bioucas-Dias

^e

, Nicolas Dobigeon

^a^,^d

aUniversity of Toulouse, IRIT/INP-ENSEEIHT Toulouse, BP 7122, Toulouse Cedex 7 31071, France

bCESBIO, University of Toulouse, CNES/CNRS/INRA/IRD/UPS, BPI 2801, Toulouse Cedex 9 31401, France

cCNES, DCT/SI/AP, 18 Avenue Edouard Belin, Toulouse 31400, France

dInstitut Universitaire de France, France

eInstituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa, Lisbon 1049-001, Portugal

Keywords:

Image interpretation Supervised learning Representation learning Hyperspectral images Non-convex optimization Matrix cofactorization

a b s t ra c t

Supervisedclassificationandrepresentationlearningaretwowidelyusedclassesofmethodstoanalyze multivariateimages.Althoughcomplementary,thesemethodshavebeenscarcelyconsideredjointlyina hierarchicalmodeling.Inthispaper,amethodcouplingthesetwoapproachesisdesignedusingamatrix cofactorizationformulation.Eachtaskismodeledasafactorizationmatrixproblemandatermrelating bothcodingmatricesis thenintroduced todrivean appropriatecoupling.The linkcanbeinterpreted asaclusteringoperationoverthelow-dimensionalrepresentationvectors.Theattributionvectorsofthe clusteringarethenusedasfeaturesvectorsfortheclassificationtask,i.e.,thecodingvectorsofthecorre- spondingfactorizationproblem.Aproximalgradientdescentalgorithm,ensuringconvergencetoacritical pointoftheobjectivefunction,isthenderivedtosolvetheresultingnon-convexnon-smoothoptimiza- tionproblem.Anevaluationoftheproposedmethodisfinallyconductedbothonsyntheticandrealdata inthespecificcontextofhyperspectralimageinterpretation,unifyingtwostandardanalysistechniques, namelyunmixingandclassification.

1. Introduction

Numerous frameworkshavebeendevelopedtoefficientlyana- lyzetheincreasingamountofremotesensingimages[1,2].Among thosemethods,supervisedclassification hasreceived considerable attention leading to the development of current state-of-the-art classificationmethodsbasedonadvancedstatisticaltools,such as convolutionalneural networks[3–5], kernelmethods[6],random forest[7]orBayesianmodels[8].Inthecontextofremotesensing image classification,these methods aimat retrievingthe class of eachpixeloftheimagegivenaspecificclassnomenclature.Within

R Part of this work has been supported by Centre National d’Études Spatiales (CNES), Occitanie Region, EU FP7 through the ERANETMED JC-WATER program (project ANR-15-NMED-0 0 02-02 MapInvPlnt), by the ANR-3IA Artiﬁcial and Natural Intelligence Toulouse Institute (ANITI) and by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under grant agreement No 681839 (project FACTORY)

∗ Corresponding author.

E-mail addresses: adrien.lagrange@enseeiht.fr (A. Lagrange),

mathieu.fauvel@inra.fr (M. Fauvel), stephane.may@cnes.fr (S. May), bioucas@lx.it.pt (J. Bioucas-Dias), nicolas.dobigeon@enseeiht.fr (N. Dobigeon).

a supervised framework, a set of pixels is assumed to be anno- tated byan expertandsubsequentlyusedasexamplesthrough a learningprocess.Thankstoextensiveresearch efforts ofthecom- munity, classificationmethods havebecome veryefficient.Never- theless, they still face some challengingissues, such asthe high dimensionofthedata,oftencoupledwiththelackoftrainingdata [9].Handlingmulti-modaland/orcomposite classeswithintrinsic intra-variabilityisalsoarecurrentissue[10]:forinstance,aclass referred to as building can gather very dissimilar samples when metallicandtiledroofsarepresentinascene.Besides,theresult- ing classification remains a high-level interpretation ofthe scene sinceitonlygivesasingleclasstosummarizeallinformationina givenpixel.

Hence,morerecentworkshaveemergedinordertoprovidea richer interpretation [11,12]. In particular, representationlearning methods assume that thedata resultsfrom thecomposition of a reduced number of elementary patterns. More precisely, the ob- servedmeasurements canbeapproximatedby mixturesofdictio- nary elementsable to simultaneously capture the variabilityand redundancy in the dataset. Representation learning can be tack- led fromdifferentperspectives, inparticularknown asdictionary

(4)

learning [13], source separation [14], compressive sensing [15], factoranalysis[16],matrixfactorization[17]orsubspace learning [18].Variousmodelshavebeenproposedtolearnadedicatedrep- resentationrelevanttothefieldofinterest,differingbyspecificas- sumptionsand/or constraints.Mostofthem attempttoidentifya dictionary andamixturefunction byminimizing areconstruction error measuring the discrepancy betweenthe chosen model and thedataset.Forinstance,non-negativematrixfactorization(NMF) aimsatrecoveringalinearmixtureofnon-negativeelementswith non-negativeactivation coefficientsleadingto additivepart-based decompositionsoftheobservations[19,20].Contrarytoaclassifica- tiontask,representationlearningmethodshavegenerallythegreat advantageofbeingunsupervised.However,forparticularpurposes, they canbespecializedtolearnarepresentationsuitedforapar- ticulartask,e.g. classificationorregression[21].Thus, representation learningprovides a rich yetcompact descriptionof thedata whereas supervised classification offers a univocal interpretation basedonpriorknowledgefromexperts.

Theideaofcombiningtherepresentationlearningandclassifi- cationtaskshasalreadybeenconsidered,mostlytousetherepre- sentationlearningmethodasadimensionalityreductionstepprior to the classification [22], wherethe low-dimensional representa- tionisusedasinputfeatures. Nonetheless,someworksintroduce theidea ofperforming thetwotaskssimultaneously [23].Forex- ample,thediscriminativeK-SVDalgorithmassociatesalinearmix- turemodeltoalinearclassifier[24].Attheend,themethodtries tolearn adictionarywell-fittedfortheclassificationtask,i.e.,the learnedrepresentationminimizesthereconstructionerrorbutalso ensuresagoodseparabilityoftheclasses.Moreintertwinedframe- workscanbe alsoconsidered,astheoneproposed in[25]where elements of the dictionary are class-specific. Joint representation learning and classification can be castas a cofactorization problem. Both tasks are interpreted as individual factorization problems and constraintsbetween thedictionaries andcoding matri- cesassociatedwiththetwoproblemscanthenbeimposed.These cofactorization-basedmodelshaveproventobehighlyefficientin manyapplicationfields,e.g.fortextmining[26],musicsourcesep- aration[27],orimageanalysis[28,29].

However, most of the available methods tend to focus on classificationresultsandgenerallyoppose reconstructionaccuracy and discriminative abilitiesof the models instead of designing a unifyinghierarchicalstructure.Capitalizingonrecentadvancesand a first attempt in[30] ina Bayesian setting, thispaper proposes a particular cofactorization method, witha dedicatedapplication to multivariate image analysis. The representation learning and classification tasks are related through the coding matrices of the two factorizationproblems. Aclustering isperformedon the low-dimensional representation and the clustering attribution vectorsareusedascodingvectorsfortheclassification.Thisnovel coupling approach produces a coherent and fully-interpretable hierarchicalmodel.Tosolvetheresultingnon-convexnon-smooth optimization problem, a proximal alternating linearized minimization (PALM) algorithm is derived, yielding guarantees of convergencetoacriticalpointoftheobjectivefunction[31].

The maincontributionsreportedin thispapercan besumma- rizedasfollows.Ageneric frameworkisproposedtodemonstrate that two ubiquitous image analysis methods, namely supervised classification and representation learning, can be unified into a unique jointcofactorizationproblem.Thisframework isinstanced forone particular applicationinthe context ofhyperspectral im- ageanalysiswheresupervisedclassificationandspectralunmixing are performedjointly.The proposed methodoffers a comprehen- sive andmeaningfulanalysisof theimage aswell ascompetitive quantitativeresultsforthetwoconsideredtasks.

This paper is organizedas follows. Section 2 deﬁnes the two factorization problems used to perform representation learning

and classiﬁcation and further discusses the joint cofactorization problem. It also details the optimization scheme developed to solve the resulting non-convex minimization problem. To illus- trate the generic framework introduced in the previous section, an application to hyperspectral image analysis is conducted in Section 3 through the dual scope ofspectral unmixing andclas- siﬁcation. Performance of the proposed framework is illustrated thanks to experiments conducted on synthetic and real data in Section4.Finally,Section5concludesthepaperandpresentssome researchperspectivestothiswork.

2. Proposedgenericframework

Therepresentationlearningandclassificationtasks aregeneri- callydefinedasfactorizationmatrixproblemsinSections2.1and 2.2. To derive a unified cofactorization formulation, a third step consistsindrawingthelinkbetweenthesetwoindependentprob- lems.Inthiswork,thiscouplingisensuredby imposingaconsis- tentstructure betweenthetwo codingmatricescorresponding to the low-dimensional representationand thefeature matrices, respectively.AsdetailedinSection2.3,itisexpressedasaclustering taskwheretheparameters describing theattributionto theclus- tersare the featurevectors, i.e.thecoding matrix resultingfrom theclassificationtask.Particularinstancesofthesethreetaskswill be detailed in Section 3 for an application to multiband image analysis.

2.1. Representationlearning

Thefundamentalassumptioninrepresentationlearningisthat the P considered L-dimensional samples, gathered in matrix Y∈ R^L^×P, belong to a R-dimensional subspace such that R L. The aim is then to recover this manifold, where samples can be ex- pressedascombinationsofelementaryvectors,hereinthecolumn ofthematrixW∈R^L^×Rsometimesreferredtoasdictionary.These samplescan be subsequentlyrepresented thanks tothe so-called coding matrix H∈R^R^×^P. Formally, identifying the dictionary and thecodingmatricescanbegenerallyexpressedasaminimization problem

minW,HJr(^Y

| ψ

(^W,H))+

λ

^w^R^w(^W)+ı_W(^W)+

λ

hRh(^H)+ı_H(^H) (1) whereψ⁽·) isamixturefunction(e.g.,linearorbilinearoperator), Jr(·)^is ^an appropriate costfunction, forexamplederived from a

β-divergence[32],R·(·)^denotepenalizationsweightedbythepa- rameterλ· andı_· (·) istheindicator functionsdeﬁnedhereon the respective sets W⊂R^L^×^R and H⊂R^R^×^P imposing some con- straintsonthedictionaryandcodingmatrices.

In the case of a linear embedding adopted in this work, the mixturefunctionwrites

ψ

(^W,H)=WH. (2)

In thiscontext, the problem (1)can be castas a factor anal- ysisdriven bythe cost functionJr(·)^. ^Dependingôn ^theâpplica- tive field, typical data-fittingmeasures include the Itakura-Saito, the Euclidean andthe Kullback–Leiblerdivergences [32]. Assum- inga low-rank model(i.e.,R ≤L),specific choicesforthe sets H andWleadtovariousstandard factormodels.Forinstance,when W is chosen asthe Stiefel manifold, the solution of (1) is given byaprincipalcomponentanalysis(PCA)[33].WhenWandHim- posenonnegativityofthedictionary andcodingmatrixelements, theproblemisknownasnonnegativematrixfactorization[19,34]. Withina supervised context, the dictionary W can be chosen thanks toa end-user expertiseorestimatedbeforehand.Without lossof generality butforthe sake ofconciseness, theframework

(5)

describedinthispaperassumesthatthisdictionaryisknown,pos- siblyovercompleteasproposedintheexperimentalillustrationde- scribedinSection4.Inthiscase,asinmanyapplications,itmakes senseto lookfora sparse representationofthesignal ofinterest toretrieveitsmostachievablecompactrepresentation[21,35].Fol- lowingthisstrategy, we propose toconsider an 1-norm sparsity penalizationonthecodingvectors,leadingtorepresentationlearn- ingtaskdeﬁnedby

minH Jr(^Y

|

^WH)+

λ

h

^H

1+ı_H(^H) ⁽³⁾

where^H1=P

p=1^h^p1 withh_pdenotingthepthcolumnofH. 2.2.Supervisedclassiﬁcation

Toclearlydefinetheclassificationtask,letfirstintroducesome key notations. The index subset of samples with an available groundtruth is denoted as L while the index subset of unlabeled samples is U such that L∩U=∅ and L∪U=P with P

{¹,...,P}^.Classifying theunlabeled samplesconsistsin assigning eachofthemtooneoftheCclasses.Thiscanbe reformulatedas theestimationofaC× PmatrixCwhosecolumnscorrespondto unknown C-dimensional attribution vectors cp=

c1,p,. . .,cC,p_T

. Eachvector ismadeof0exceptforc_i_,_p=1when thepthsample isassignedtheithclass.

Numerous classification ruleshave beenproposed inthe liter- ature[36]. Mostof them rely on a K × P matrix Z=[z₁,...,z_P] of features zp (p∈P) associated with each sample and derived fromtherawdata.Withinasupervisedframework,theattribution matrixC_LandfeaturematrixZ_Lofthelabeleddataareexploited during the learning step, where ·L denotes the corresponding submatrixwhose columnsare indexedby L.Fora wide rangeof classifiers,derivingaclassificationrulecanbeachievedbysolving theoptimizationproblem

minQ Jc(^CL

| φ

(^Q,Z_L))+

λ

qRq(^Q) ⁽⁴⁾

whereQ∈R^C^×^K is theset ofclassiﬁer parameters to be inferred, Rq(·) ^refer ^to regularizations imposed on Q and Jc is a cost function measuring the quality of the classiﬁcation such as the quadratic loss [24] or cross-entropy [37]. Moreover, in (4), φ⁽^Q^,

· ) deﬁnes a element-wise nonlinear mapping between the features and the class attribution vectors parametrized by Q, e.g., derivedfrom asigmoid ora softmaxoperators.In thiswork, the classiﬁer is assumed to be linear, which leads to a vector-wise post-nonlinearmapping

φ

(^Q,Z_L)=

φ

(^QZL) ⁽⁵⁾

with

φ

(^X)⁼^[

φ

(^x1)^,^{. . .}^,

φ

(^x^p)^]^. ⁽⁶⁾

Once the classiﬁer parameters have been estimated by solv- ing (4), the unknown attributionvectorsC_U can be subsequently inferredduringthetestingstepbyapplyingthenonlineartransfor- mationtothecorrespondingpredictedfeaturesZˆ_Uassociatedwith theunlabeledsamples. Theobtained outputs are relaxedattribu- tionvectorscˆ_p=φ(^Q^z^ˆp)⁽^p∈U)andthemostprobablepredicted sampleclasscanbecomputedasargmax_ic_i_,_p.

Under the proposed formulation ofthe classiﬁcation task,the learning and testing steps can be conducted simultaneously, a framework usually referred to as semi-supervised, withthe ben- eﬁcial opportunity to introduce additional regularizations and/or constraintson the submatrix ofunknown attribution vectorsC_U. Theinitialproblem(4)isthusextendedtothefollowingone minQ,CUJc(^C

| φ

(^QZ))+

λ

qRq(^Q)+

λ

cRc(^C)+ı_C(^CU) ⁽⁷⁾

Fig. 1. Structure of the cofactorization model. Variables in blue stand for observa- tions or available external data. Variables in olive green are linked through the clustering task here formulated as an optimization problem. The variable in a dotted box is assumed to be known or estimated beforehand in this work.

whereC=[C_LC_U]andC⊂R^C^×^|Û^|denotesafeasiblesetfortheat- tributionmatrixC_U.Asdiscussedabove,thecostfunctionJc(^C|^C^ˆ) measures the actual classification loss, i.e., the discrepancy between the attribution vector C of the trainingset andthe attribution vectors Cˆ predictedby the classifier. Twoparticular cases fittingthisgenericmodelareprovidedinSections3.2.1and3.2.2. The attribution vectors are defined as Cˆ=φ(^Q^Z^ˆ) ^where φ⁽·) is a nonlinear function applied to the output of a linear classifier.

The regularization term Rq(^Q) ^penalizes ôver ^the ^parameters ôf theclassifiers.Atypicalexampleisaquadraticpenalizationwhich aimsatavoidingoverfitting,asconventionallydonewhenoptimiz- ingneuralnetworksandgenerallyreferredtoasweightdecay[38]. Finally, theregularization term Rc(^C) ^penalizes ôver ^theâttribu- tion matrix.Typicalexamplesinclude spatialregularizationssuch astotalvariation(TV)whendealingwithimageclassification.The indicator functionı_C(^CU)ênforces ^sum-to-oneândnon-negativity constraints such that each attribution vector cp (p∈U) can then beinterpretedasaprobabilityvectorofbelongingtoeachclass.In suchacase,thefeasiblesetischosenasC=S^|_CÛ^|where

SC

u∈R^C

∀

^k, ^uk≥0and C

k=1

u_k=1

. (8)

2.3. Couplingrepresentationlearningandclassiﬁcation

Up to this point, the representation learning and supervised classificationtaskshavebeenformulatedastwoindependentma- trixfactorizationproblemsgivenby(3)and(7),respectively.This work proposes to join them by drawing an implicit relation be- tweentwofactorsinvolvedinthesetwoproblems.Inspiredbyhi- erarchicalBayesianmodelssuchastheoneproposedin[30],both problemsarecoupledthroughtheactivationmatricesHandZ,as illustrated in Fig. 1. More precisely, the coding vectors in H are clusteredsuchthatthefeaturevectorsinZaredefinedastheat- tribution vectors to the K clusters. Ideally, clustering attribution vectorszp arefilledwithzerosexceptforz_k_,_p=1whenhp isas- sociated withthekthcluster. Thus, thevectorszp (p∈P) are as- sumedtobedefinedontheK-dimensionalprobabilitysimplexSK

similarly definedas(8)and ensuringnon-negativity andsum-to- one constraints. Many clustering algorithms can be expressed as optimizationproblemsuch asthe well-knownk-means algorithm and many of its variants [39,40]. Adopting this formulation,and denoting θ ^the ^setôf^parameters ôf ^the^clusteringâlgorithm, ^the

(6)

Table 1

Overview of notations.

Parameter

P ∈ R Number of observations L ∈ R Dimension of observations C ∈ R Number of classes K ∈ R Number of features/clusters P = {1 , . . . , P } Index set of observations L ⊂P Index set of labeled samples

L i ⊂L Index set of labeled samples in the i th class U = P\^L ^Indexset of unlabeled samples

Y ∈ R ^L^×^P Observations W ∈ R ^L^×R Dictionary H ∈ R ^R^×P Coding matrix Q ∈ C ^C×P Classiﬁer parameters

C L∈ R ^C×^|^L^| Attribution matrix of labeled data C U ∈ R ^C×^|^U^| Attribution matrix of unlabeled data C = [ C _LC _U] Class attribution matrix

Z ∈ R ^K×P Cluster attribution matrix θ∈ Clustering parameters

clusteringtaskcanbedeﬁnedastheminimizationproblem min

Z,θ Jg(^H,Z;

θ

)+

λ

^zRz(^Z)+

λ

θR_θ(

θ

)+ı_SP

K(^Z)+ı(

θ

) ⁽⁹⁾

where^deﬁnes^a^feasible^set^for^the^parametersθ^.

It is worth noting that introducing this coupling term is one of the major novelty of the proposed approach. When considering task-driven dictionary learning methods, it is usual to in- tertwine the representation learning and the classification tasks by directlyimposing H=Z[24,41].Sincethesemethods generally relyonalinearclassifier,onemajordrawback ofsuch approaches is their unability to deal with non-separable classes in the low- dimensional representation space. In such cases, the underlying modelcannotbediscriminativeanddescriptivesimultaneouslyand theresultingtasksbecomeadversarial.Whenconsideringthepro- posed coupling term, the cluster attribution vectors zp offer the possibility of linearly separating any group of clusters from the others.Asaconsequence,themodelbenefitsfrommoreflexibility, withbothdiscriminativeanddescriptiveabilitiesinamoregeneral sense.

2.4. Globalcofactorizationproblem

Unifyingtherepresentationlearningtask(3)andthe classiﬁca- tiontask(7)throughtheclusteringtask(9)leadstothefollowing jointcofactorizationproblem

Hmin,Q,C_U, Z,θ

λ

⁰^J^r(^Y

|

^WH)⁺

λ

h

^H

1

+

λ

¹^J^c(^C

| φ

(^QZ))⁺

λ

^q^R^q(^Q)⁺

λ

^c^R^c(^C) +

λ

2Jg(^H,Z;

θ

)+

λ

^zRz(^Z)+

λ

θR_θ(

θ

) +ı_H(^H)+ı_S|U|

K (^CU)+ı_S^P

K(^Z)+ı(

θ

) ⁽¹⁰⁾

where λ0,λ1 and λ2 control therespective contribution of each taskdata-ﬁttingterm.Allnotationsandparameterdimensionsare summarizedin Table1.Ageneric algorithmicscheme solvingthe problem(10)isproposedinthenextsection.

2.5. Optimizationscheme

Theminimization problemdeﬁnedby (10)isnotgloballycon- vex.Toreachalocalminimizer,weproposetoresorttotheprox- imal alternating linearized minimization (PALM) algorithm intro- ducedin[31].Thisalgorithm isbasedon proximaldescent steps, which allows non-smooth terms to be handled. Moreover it is guaranteed to converge to a critical point of the objective function evenin thecaseofnon-convexproblem. Thismeans that, if

theinitializationisgoodenough, itisexpectedtolikelyconverge toasolutionclosetotheglobaloptimum.ToimplementPALM,the problem(10)isrewrittenintheformofanunconstrainedproblem expressedasasumofa smoothcouplingtermg(·)andseparable non-smoothtermsf_j(·)(j∈{⁰,...,4}⁾^as^follows

min

H,θ,Z, Q,CU

f0(^H)+f1(

θ

)+f2(^Z)+f3(^CU)+g(^H,

θ

^,^Z^,^CU,Q) ⁽¹¹⁾

where

f0(^H)=ı_H(^H)+

λ

h

^H

1 f2(^Z)=ı_S^P

K(^Z) f1(

θ

)=ı(

θ

) ^f3(^CU)=ı_S|U|

K (^CU)

andthecouplingfunctionis g(^H,

θ

^,^Z^,^CU,Q)=

λ

⁰^J^r(^Y

|

^WH)

+

λ

¹^J^c(^C

| φ

(^QZ))+

λ

^q^R^q(^Q)+

λ

^c^R^c(^C) +

λ

²^J^g(^W,Z;

θ

)+

λ

^z^R^z(^Z)+

λ

θR_θ(

θ

). (12) To ensure the stated guarantees of PALM, all f_j(·) have to be proper, lower semi-continuous function f_j:Rⁿ^j→(−∞,+∞], whichensures inparticularthatthe associatedproximaloperator iswell-defined.Additionally,sufficientconditionson thecoupling function are that g(·) is a C² function (i.e., withcontinuous first andsecond derivatives)andthat its partial gradientsare globally Lipschitz.Forexample, partialgradient ∇Hg(^H,θ,Z,C_U,Q) ^should begloballyLipschitzforanyfixedθ^,^Z^,^CU,Q,thatis

∇

^H^g(^H¹^,

θ

^,^Z^,^CU,Q)⁻

∇

^H^g(^H²^,

θ

^,^Z^,^CU,Q)

≤LH(

θ

^,^Z^,^CU,Q)

^H¹⁻^H²

^,

∀

^H¹^,^H²^∈^R^R^×P ⁽¹³⁾

whereL_H(θ,Z,C_U,Q),simplydenotedL_Hhereafter,istheLipschitz constant. Forsake of conciseness,we referto [31] to get further details.

Themainideaofthealgorithmisthentoupdateeachvariable oftheproblemalternativelyusingaproximalgradientdescent.The overallschemeissummarizedinAlgorithm1.Forapracticalim-

Algorithm1:PALM.

1 Initialize variables H ⁰, θ⁰, Z ⁰, C U0and Q ⁰; 2 Set α> 1 ;

3 while stopping criterion not reached do

4 H ^k⁺¹∈ prox ^α_f₀^L^H(H ^k−_α¹_L_H∇Hg(H ^k, θ^k, Z ^k, C ^k_U, Q ^k)); 5 θ^k⁺¹∈ prox ^α_f₁^L^θ(θ^k−_α¹_L_θ∇θg(H ^k⁺¹, θ^k, Z ^k, C ^k_U, Q ^k)); 6 Z ^k⁺¹∈ prox ^α_f₂^L^Z(Z ^k−_α¹_L_Z∇Zg(H ^k⁺¹, θ^k⁺¹, Z ^k, C ^k_U, Q ^k)); 7 Q ^k⁺¹∈ prox ^α_f₃^L^Q(Q ^k−_α¹_L_Q∇Qg(H ^k⁺¹, θ^k⁺¹, Z ^k⁺¹, C Uk, Q ^k)); 8 C ^k_U⁺¹∈ prox ^α_f₄^L^C^U(C ^k_U−_α_L¹_C

U∇CUg(H ^k⁺¹, θ^k⁺¹, Z ^k⁺¹, C ^k_U, Q ^k⁺¹)); 9 end

10 return H ênd, θênd, Z ênd, Q ênd, C ênd_U

plementation, one needs to compute thepartial gradients ofg(·) explicitlyandtheir Lipschitz constantsto perform a gradientde- scent step, followed by a proximal mapping associated with the non-smoothterms f_j(·). The objective function is then monitored ateachiteration andthealgorithm isstoppedwhen convergence is reached. Note that, when a speciﬁc penalization R·(·) ^is ^non- smoothornon-gradient-Lipschitz,itispossibletomoveitintothe correspondingindependenttermf_j(·)toensuretherequiredprop- ertyofthecouplingfunctiong(·).Thisisforinstancethecasefor thesparse penalization used over H whichhas been moved into f₀(·). Nonetheless, asmentionedabove, the proximal operatoras- sociated witheach f_j(·) is needed. Thus, even when the function consistsofseveralterms,aclosed-formexpressionofthisoperator shouldbeknown.Alternatively,oneshouldbeabletocomposethe proximaloperatorsassociatedwitheachtermoff_j(·)[42].

(7)

Fig. 2. Spectral unmixing concept (source US Navy NEMO).

3. Application:hyperspectralimagesanalysis

Ageneralframeworkhasbeenintroduced intheprevioussec- tion.Asanillustration,aparticularinstanceofthisgenericframe- work is now considered, where explicit representation learning, classiﬁcationandclusteringareintroduced.Thespeciﬁccaseofhy- perspectralimagesanalysisisconsideredforthisusecaseexample.

Contrarytoconventionalcolorimagingwhichonlycapturesthe reflectance measure forthree wavelengths (red, blue, green), hy- perspectralimagingmakesitpossibletomeasurereflectanceofthe observedsceneforseveralhundredsofwavelengthsfromvisibleto invisibledomain.Eachpixeloftheimagecanthusbe represented asavectorofreflectance,calledspectrum,whichcharacterizesthe observedmaterial.

Onedrawbackofhyperspectralimagesisusuallyaweakerspa- tial resolution due to sensor limitations. The direct consequence of this poor spatial resolution is the presence of mixed pixels, i.e.,pixelscorrespondingtoareascontainingseveralmaterials.Ob- servedspectra arein thiscasetheresult ofa speciﬁcmixture of theelementaryspectra,calledendmembers, associatedwithindi- vidualmaterialspresentinthepixel.Theproblemofretrievingthe proportionsofeachmaterialineachpixelisreferredtoasspectral unmixing[11].Thisproblemcanbeseenasaspeciﬁccaseofrep- resentationlearningwherethedictionaryiscomposedofthesetof endmembersstandingfortheendmember spectraandthecoding matrixistheso-calledabundancematrixcontainingtheproportion ofeachmaterialineachpixel.

Spectral unmixing is introduced as a representation learning taskinSection 3.1.The specificclassifierusedforthisapplication isthenexplainedinSection3.2andfinallySection3.3presentsthe clusteringadoptedtorelatetheabundancematrixandtheclassifi- cationfeaturematrix.

3.1.Spectralunmixing

Asexplained,eachpixelofanhyperspectralimageischaracter- ized by a reﬂectance spectrum that physics theory approximates asa combination of endmembers, each corresponding to a spe- ciﬁcmaterial,asillustrated in Fig.2. Formally,inthis applicative scenario,the L-dimensional sampleyp denotes the L-dimensional spectrumofthepthpixelofthehyperspectralimage(p∈P).Each observationvectorsyp canbeexpressedasafunctionoftheend- membermatrix W(containingthe Relementaryspectra) andthe abundancevectorhp∈R^RwithRL.

In the case of the most commonly adopted linear mixture model, each observation yp is assumed to be a linear combina-

tionoftheendmemberspectrawr(r=1,...,R)corruptedbysome noise,underlyingthe linearembedding(2).Assuming aquadratic data-ﬁttingterm,thecostfunctionassociatedwiththerepresenta- tionlearningtaskin(1)iswritten

Jr(^Y

|

^WH)=1

2

^Y⁻^WH

²F. (14)

The abundance vector h_p is usually interpreted asa vector of proportionsdescribingtheproportionofeachelementarycompo- nent in thepixel. Thus, to derive an additive composition of the observed pixels, a nonnegative constraint is considered for each element of theabundance matrix H, i.e., H=R^R₊^×P. In thiswork, no sum-to-one constraintis considered since it has been argued thatleavingthisconstraintoutoffersabetteradaptationtopossi- blechanges ofilluminationinthe scene[43].Additionally,asthe endmember matrix W is the collection of reﬂectance spectra of theendmembers,itisalsoexpectedtobenon-negative.Whenthis dictionaryneedstobeestimated,theresultingproblemisasparse non-negative matrix factorization (NMF) task. When the dictio- naryisknownorestimatedbeforehand,theresultingoptimization problemisthenonnegativesparsecodingproblem

minH

1

2

^Y⁻^WH

²F+

λ

h

^H

1+ı_R^R×P

+ (^H) ⁽¹⁵⁾

wherethe sparsitypenalization actuallysupports the assumption thatonlyafewmaterialsarepresentinagivenpixel.

3.2. Classiﬁcation

In the considered application, two loss functions associated with the classification problem have been investigated, namely quadraticlossandcross-entropyloss.Oneadvantageofthesetwo loss functions is that they can be used in a multi-class classification (i.e., withmore than two classes). Moreover, this choice mayfulfill therequired conditions statedin Section 2.5 to apply PALM since,coupled withan appropriate φ⁽·) function, both loss costs are smooth and gradient-Lipschitz according to each esti- matedvariables.

3.2.1. Quadraticloss

Thequadraticlossisthemostsimplewaytoperformaclassiﬁ- cationtaskandhavebeenextensivelyused[25,44,45].Itisdeﬁned as

Jc(^C

|

^C^ˆ)=1

2

CD−CˆD

²

F (16)

where Cˆ denotes the estimated attribution matrix. In (16), the P × P matrix D is introduced to weight the contribution of the labeled data withrespect tothe unlabeledone and to dealwith the case of unbalanced classes in the training set. Weights are chosen to be inversely proportional to class frequencies in the input data. The weight matrix is deﬁned as the diagonal matrix D=diag[d₁,...,d_P]with

dp=

⎧ ⎨

⎩

1

|Li|, if p∈Li;

1

|U|, if p∈U; (17)

where Li denotes the set of indexes of labeled pixels of the ith class(i=1,...,C).Thus,consideringalinearclassifier,thegeneric classificationproblemin(7)canbespecifiedforthequadraticloss minQ,C_U

1

2

^CD⁻^QZD

²F+

λ

^c^R^c(^C)⁺^ı_S^|^U^|

C (^CU) ⁽¹⁸⁾

wherenoadditionalconstraintsnorpenalizationisappliedtothe classiﬁerparametersQ.Besides,whensamplesobeyaspatiallyco- herentstructure,asitisthecasewhenanalyzinghyperspectralim- ages,it isoftendesirabletotransfer thisstructureto theclassiﬁ- cationmap.Such acharacteristicscanbe achievedby considering

(8)

a spatialregularization Rc(^C)^applied ^to ^the attributions vectors.

Followingthisassumption,thisworkconsidersaregularizedcoun- terpart of theweightedvectorial totalvariation (vTV), promoting a spatially piecewise constant behavior of the classiﬁcation map [46]

^C

vTV=

m,n

β

m,n

[

∇

hC]_m_,_n

²

2+

[

∇

vC]_m_,_n

²

2+ (19)

where(m,n) are thespatialpositionpixelindexes and[∇h(·)]m,n

and [∇^v⁽·)]m,n stand for horizontal andvertical discrete gradient operatorsevaluatedatagivenpixel,¹respectively,i.e.,

[

∇

hC]_m_,_n=c₍_m₊₁_,_n₎−c₍_m_,_n₎ [

∇

^v^C^]m,n=c₍_m,n₊₁₎−c₍_m,n₎.

Theweightsβm,ncanbecomputedbeforehandtoadjustthepe- nalizationswithrespecttoexpectedspatialvariationsofthescene.

Theycan beestimateddirectlyfromtheimage to beanalyzed or extracted froma complementary datasetas in[47]. Theywill be speciﬁedduringtheexperimentsreportedinSection4.Moreover, the smoothing parameter > 0 ensures the gradient-Lipschitz propertyofthecouplingtermg(·),asrequiredinSection2.5.

3.2.2. Cross-entropyloss

The quadratic loss has the advantage to be expressed sim- ply andthe associated Lipschitz constant ofthe partial gradients are triviallyobtained. However, thisloss functionis knownto be highlyinﬂuencedby outlierswhich canresultinadegraded pre- dictive accuracy [48]. A more sophisticated way to conduct the classiﬁcationtaskistoconsideracross-entropyloss

Jc(^C

|

^C^ˆ)=−

p∈P

d²_p

i∈C

ci,plog

cˆi,p

(20)

combinedwithalogisticregression,i.e.,wherethenonlinearmap- ping(5)iselement-wisedeﬁnedas

[

φ

(^X)^]i,j= 1

1+exp(−x_i,j)⁼^sigm(^xi,j) ⁽²¹⁾

withi∈{¹,...,C} ^and ^p∈P.Thisclassiﬁer can actually be inter- pretedasaone-layerneuralnetworkwithasigmoidnon-linearity.

Cross-entropy loss is indeeda very conventional loss function in theneural network/deeplearningcommunity[38].Inthe present case,thecorrespondingoptimizationproblemcanbewritten minQ,CU−

p∈P

d²_p

i∈C

ci,plog(^sigm(^qi:zp))) +

λ

^q^R^q(^Q)⁺

λ

^c

^C

vTV+ı_S|U|

C (^CU) ⁽²²⁾

whereq_i_:∈R¹^×^K denotesthe ithlineofthematrixQ.Thepenal- izationRq(^Q)îs^here^chosenâsRq(^Q)= ¹₂^Q²Ftopreventtheloss functiontoartificiallydecreasewhen^qi:²îsincreasing.Thisreg- ularizationhasbeenextensivelystudiedintheneuralnetworklit- erature where it is referred to asweight decay [38]. In (22),the regularization Rc(^CU)âpplied ^to ^theattribution matrixis chosen againasavTV-likepenalization(see(19)).

3.3. Clustering

Fortheconsideredapplication,theconventionalk-meansalgo- rithm hasbeenchosenbecause ofitsstraightforwardformulation asan optimization problem. Bydenotingθ={^B} ^a ^R × Kmatrix collecting K centroids, theclusteringtask(9) canbe rewritten as thefollowingNMFproblem[40]

minZ,B

1

2

^H⁻^BZ

²F+

λ

^z^R^z(^Z)+ı_S^P

K(^Z)+ı_R^R×K

+ (^B) ⁽²³⁾

1With a slight abuse of notations, c (m,n)refers to the p th column of C where the p th pixel is spatially indexed by ( m, n ).

where Rz(^Z) ^should ^promote ^Z ^to ^be ^composed ôf ôrthogo- nallines. Combined withthe nonnegativityand sum-to-onecon- straints,itwouldensure thatzp isavectorofzerosexceptforits kthcomponentequalto1,i.e.,meaningthatthepthpixelbelongs to the kthcluster. However, handlingthis orthogonality property withinthePALMoptimizationschemedetailedinSection2.5isnot straightforward,inparticularbecausetheproximal operatorasso- ciated to thispenalization cannot be explicitly computed. In this work,weproposetoremovethisorthogonalityconstraintsincere- laxedattributionvectorsmaybericherfeaturevectorsfortheclas- sificationtask.

3.4.Multi-objectiveproblem

Based on the quadratic and cross-entropy loss functions con- sideredinthe classiﬁcationtask,two distinct globaloptimization problems are obtained. When considering the quadratic loss of Section3.2.1,themulti-objectiveproblem(10)writes

minH,Q,Z CU,B

λ

0

2

^Y⁻^WH

²F+

λ

h

^H

1+ı_RR×P + (^H) +

λ

1

2

^CD⁻^QZD

²F+

λ

c

^C

vTV+ı_S|U| C (^CU) +

λ

2

^H⁻^BZ

²F+ı_S^P

K(^Z)+ı_R^R×K

+ (^B). (24)

Instead,whenconsideringthe cross-entropyloss functionpro- posed in Section 3.2.2, the optimization problem (10) is deﬁned as

minH,Q,Z C_U,B

λ

0

2

^Y⁻^WH

²F+

λ

h

^H

1+ı_R^R×P + (^H)

−

λ

¹

2

p∈P

d²_p

i∈C

ci,plog(^sigm(−qi:zp)))

+

λ

^q

2

^Q

²F+

λ

^c

^C

vTV+ı_S|U| C (^CU) +

λ

2

^H⁻^BZ

²F+ı_SP

K(^Z)⁺^ı_R^R₊^×K(^B)^. ⁽²⁵⁾

Bothproblems are particular instances of nonnegative matrix co-factorization [27,28]. Tosummarize, the hyperspectral pixel is firstdescribedasacombinationofelementaryspectrathroughthe learningrepresentationstep,akaspectral unmixing.Then, assum- ingthatthereexistgroupsofpixelsresultingfromthesamemix- tureofmaterials,a clusteringisperformedamongtheabundance vectors. And finally, attribution vectors to the clusters are used as feature vectors for the classification supporting the idea that classesaremadeofamixtureofclusters.Forbothmulti-objective problems(24)and(25),allconditionsrequiredtotheuseofPALM algorithm describedinSection 2.5 aremet. Detailsregarding the twooptimizationschemesdedicatedtothesetwoproblemsarere- portedintheAppendix.

3.5.Complexityanalysis

Regarding the computational complexity of the proposed Algorithm1,deriving thegradientsshowsthat itisdominatedby matrixproductoperations.Ityieldsthatthealgorithmhasanover- allcomputationalcostinO(^NK²^P)^where^Nîs^the^numberôfîter- ations.

4. Experiments

4.1. Implementationdetails

Before presenting the experimental results, it is worth clari- fying the choices which havebeen made regarding the practical