• Aucun résultat trouvé

Matrix cofactorization for joint representation learning and supervised classification : application to hyperspectral image analysis

N/A
N/A
Protected

Academic year: 2021

Partager "Matrix cofactorization for joint representation learning and supervised classification : application to hyperspectral image analysis"

Copied!
18
0
0

Texte intégral

(1)

HAL Id: hal-02887755

https://hal.archives-ouvertes.fr/hal-02887755

Submitted on 2 Jul 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Matrix cofactorization for joint representation learning and supervised classification : application to

hyperspectral image analysis

Adrien Lagrange, Mathieu Fauvel, Stéphane May, José M. Bioucas-Dias, Nicolas Dobigeon

To cite this version:

Adrien Lagrange, Mathieu Fauvel, Stéphane May, José M. Bioucas-Dias, Nicolas Dobigeon.

Matrix cofactorization for joint representation learning and supervised classification : appli- cation to hyperspectral image analysis. Neurocomputing, Elsevier, 2020, 385, pp.132-147.

�10.1016/j.neucom.2019.12.068�. �hal-02887755�

(2)

OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible

Any correspondence concerning this service should be sent

to the repository administrator: tech-oatao@listes-diff.inp-toulouse.fr

This is an author’s version published in: https://oatao.univ-toulouse.fr/ 2 6328

To cite this version:

Lagrange, Adrien and Fauvel, Mathieu and May, Stéphane and Bioucas-Dias, José M. and Dobigeon, Nicolas Matrix cofactorization for joint representation learning and supervised classification : application to hyperspectral image analysis. (2020) Neurocomputing, 385. 132-147. ISSN 0925-2312 .

Official URL:

https://doi.org/10.1016/j.neucom.2019.12.068

Open Archive Toulouse Archive Ouverte

(3)

Matrix cofactorization for joint representation learning and supervised classification – Application to hyperspectral image analysis

R

Adrien Lagrange

a,

, Mathieu Fauvel

b

, Stéphane May

c

, José Bioucas-Dias

e

, Nicolas Dobigeon

a,d

aUniversity of Toulouse, IRIT/INP-ENSEEIHT Toulouse, BP 7122, Toulouse Cedex 7 31071, France

bCESBIO, University of Toulouse, CNES/CNRS/INRA/IRD/UPS, BPI 2801, Toulouse Cedex 9 31401, France

cCNES, DCT/SI/AP, 18 Avenue Edouard Belin, Toulouse 31400, France

dInstitut Universitaire de France, France

eInstituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa, Lisbon 1049-001, Portugal

Keywords:

Image interpretation Supervised learning Representation learning Hyperspectral images Non-convex optimization Matrix cofactorization

a b s t ra c t

Supervisedclassificationandrepresentationlearningaretwowidelyusedclassesofmethodstoanalyze multivariateimages.Althoughcomplementary,thesemethodshavebeenscarcelyconsideredjointlyina hierarchicalmodeling.Inthispaper,amethodcouplingthesetwoapproachesisdesignedusingamatrix cofactorizationformulation.Eachtaskismodeledasafactorizationmatrixproblemandatermrelating bothcodingmatricesis thenintroduced todrivean appropriatecoupling.The linkcanbeinterpreted asaclusteringoperationoverthelow-dimensionalrepresentationvectors.Theattributionvectorsofthe clusteringarethenusedasfeaturesvectorsfortheclassificationtask,i.e.,thecodingvectorsofthecorre- spondingfactorizationproblem.Aproximalgradientdescentalgorithm,ensuringconvergencetoacritical pointoftheobjectivefunction,isthenderivedtosolvetheresultingnon-convexnon-smoothoptimiza- tionproblem.Anevaluationoftheproposedmethodisfinallyconductedbothonsyntheticandrealdata inthespecificcontextofhyperspectralimageinterpretation,unifyingtwostandardanalysistechniques, namelyunmixingandclassification.

1. Introduction

Numerous frameworkshavebeendevelopedtoefficientlyana- lyzetheincreasingamountofremotesensingimages[1,2].Among thosemethods,supervisedclassification hasreceived considerable attention leading to the development of current state-of-the-art classificationmethodsbasedonadvancedstatisticaltools,such as convolutionalneural networks[3–5], kernelmethods[6],random forest[7]orBayesianmodels[8].Inthecontextofremotesensing image classification,these methods aimat retrievingthe class of eachpixeloftheimagegivenaspecificclassnomenclature.Within

R Part of this work has been supported by Centre National d’Études Spatiales (CNES), Occitanie Region, EU FP7 through the ERANETMED JC-WATER program (project ANR-15-NMED-0 0 02-02 MapInvPlnt), by the ANR-3IA Artificial and Natural Intelligence Toulouse Institute (ANITI) and by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme un- der grant agreement No 681839 (project FACTORY)

Corresponding author.

E-mail addresses: adrien.lagrange@enseeiht.fr (A. Lagrange),

mathieu.fauvel@inra.fr (M. Fauvel), stephane.may@cnes.fr (S. May), bioucas@lx.it.pt (J. Bioucas-Dias), nicolas.dobigeon@enseeiht.fr (N. Dobigeon).

a supervised framework, a set of pixels is assumed to be anno- tated byan expertandsubsequentlyusedasexamplesthrough a learningprocess.Thankstoextensiveresearch efforts ofthecom- munity, classificationmethods havebecome veryefficient.Never- theless, they still face some challengingissues, such asthe high dimensionofthedata,oftencoupledwiththelackoftrainingdata [9].Handlingmulti-modaland/orcomposite classeswithintrinsic intra-variabilityisalsoarecurrentissue[10]:forinstance,aclass referred to as building can gather very dissimilar samples when metallicandtiledroofsarepresentinascene.Besides,theresult- ing classification remains a high-level interpretation ofthe scene sinceitonlygivesasingleclasstosummarizeallinformationina givenpixel.

Hence,morerecentworkshaveemergedinordertoprovidea richer interpretation [11,12]. In particular, representationlearning methods assume that thedata resultsfrom thecomposition of a reduced number of elementary patterns. More precisely, the ob- servedmeasurements canbeapproximatedby mixturesofdictio- nary elementsable to simultaneously capture the variabilityand redundancy in the dataset. Representation learning can be tack- led fromdifferentperspectives, inparticularknown asdictionary

(4)

learning [13], source separation [14], compressive sensing [15], factoranalysis[16],matrixfactorization[17]orsubspace learning [18].Variousmodelshavebeenproposedtolearnadedicatedrep- resentationrelevanttothefieldofinterest,differingbyspecificas- sumptionsand/or constraints.Mostofthem attempttoidentifya dictionary andamixturefunction byminimizing areconstruction error measuring the discrepancy betweenthe chosen model and thedataset.Forinstance,non-negativematrixfactorization(NMF) aimsatrecoveringalinearmixtureofnon-negativeelementswith non-negativeactivation coefficientsleadingto additivepart-based decompositionsoftheobservations[19,20].Contrarytoaclassifica- tiontask,representationlearningmethodshavegenerallythegreat advantageofbeingunsupervised.However,forparticularpurposes, they canbespecializedtolearnarepresentationsuitedforapar- ticulartask,e.g. classificationorregression[21].Thus, representa- tion learningprovides a rich yetcompact descriptionof thedata whereas supervised classification offers a univocal interpretation basedonpriorknowledgefromexperts.

Theideaofcombiningtherepresentationlearningandclassifi- cationtaskshasalreadybeenconsidered,mostlytousetherepre- sentationlearningmethodasadimensionalityreductionstepprior to the classification [22], wherethe low-dimensional representa- tionisusedasinputfeatures. Nonetheless,someworksintroduce theidea ofperforming thetwotaskssimultaneously [23].Forex- ample,thediscriminativeK-SVDalgorithmassociatesalinearmix- turemodeltoalinearclassifier[24].Attheend,themethodtries tolearn adictionarywell-fittedfortheclassificationtask,i.e.,the learnedrepresentationminimizesthereconstructionerrorbutalso ensuresagoodseparabilityoftheclasses.Moreintertwinedframe- workscanbe alsoconsidered,astheoneproposed in[25]where elements of the dictionary are class-specific. Joint representation learning and classification can be castas a cofactorization prob- lem. Both tasks are interpreted as individual factorization prob- lems and constraintsbetween thedictionaries andcoding matri- cesassociatedwiththetwoproblemscanthenbeimposed.These cofactorization-basedmodelshaveproventobehighlyefficientin manyapplicationfields,e.g.fortextmining[26],musicsourcesep- aration[27],orimageanalysis[28,29].

However, most of the available methods tend to focus on classificationresultsandgenerallyoppose reconstructionaccuracy and discriminative abilitiesof the models instead of designing a unifyinghierarchicalstructure.Capitalizingonrecentadvancesand a first attempt in[30] ina Bayesian setting, thispaper proposes a particular cofactorization method, witha dedicatedapplication to multivariate image analysis. The representation learning and classification tasks are related through the coding matrices of the two factorizationproblems. Aclustering isperformedon the low-dimensional representation and the clustering attribution vectorsareusedascodingvectorsfortheclassification.Thisnovel coupling approach produces a coherent and fully-interpretable hierarchicalmodel.Tosolvetheresultingnon-convexnon-smooth optimization problem, a proximal alternating linearized min- imization (PALM) algorithm is derived, yielding guarantees of convergencetoacriticalpointoftheobjectivefunction[31].

The maincontributionsreportedin thispapercan besumma- rizedasfollows.Ageneric frameworkisproposedtodemonstrate that two ubiquitous image analysis methods, namely supervised classification and representation learning, can be unified into a unique jointcofactorizationproblem.Thisframework isinstanced forone particular applicationinthe context ofhyperspectral im- ageanalysiswheresupervisedclassificationandspectralunmixing are performedjointly.The proposed methodoffers a comprehen- sive andmeaningfulanalysisof theimage aswell ascompetitive quantitativeresultsforthetwoconsideredtasks.

This paper is organizedas follows. Section 2 defines the two factorization problems used to perform representation learning

and classification and further discusses the joint cofactorization problem. It also details the optimization scheme developed to solve the resulting non-convex minimization problem. To illus- trate the generic framework introduced in the previous section, an application to hyperspectral image analysis is conducted in Section 3 through the dual scope ofspectral unmixing andclas- sification. Performance of the proposed framework is illustrated thanks to experiments conducted on synthetic and real data in Section4.Finally,Section5concludesthepaperandpresentssome researchperspectivestothiswork.

2. Proposedgenericframework

Therepresentationlearningandclassificationtasks aregeneri- callydefinedasfactorizationmatrixproblemsinSections2.1and 2.2. To derive a unified cofactorization formulation, a third step consistsindrawingthelinkbetweenthesetwoindependentprob- lems.Inthiswork,thiscouplingisensuredby imposingaconsis- tentstructure betweenthetwo codingmatricescorresponding to the low-dimensional representationand thefeature matrices, re- spectively.AsdetailedinSection2.3,itisexpressedasaclustering taskwheretheparameters describing theattributionto theclus- tersare the featurevectors, i.e.thecoding matrix resultingfrom theclassificationtask.Particularinstancesofthesethreetaskswill be detailed in Section 3 for an application to multiband image analysis.

2.1. Representationlearning

Thefundamentalassumptioninrepresentationlearningisthat the P considered L-dimensional samples, gathered in matrix Y∈ RL×P, belong to a R-dimensional subspace such that R L. The aim is then to recover this manifold, where samples can be ex- pressedascombinationsofelementaryvectors,hereinthecolumn ofthematrixW∈RL×Rsometimesreferredtoasdictionary.These samplescan be subsequentlyrepresented thanks tothe so-called coding matrix H∈RR×P. Formally, identifying the dictionary and thecodingmatricescanbegenerallyexpressedasaminimization problem

minW,HJr(Y

| ψ

(W,H))+

λ

wRw(W)+ıW(W)+

λ

hRh(H)+ıH(H) (1) whereψ(·) isamixturefunction(e.g.,linearorbilinearoperator), Jr(·)is an appropriate costfunction, forexamplederived from a

β-divergence[32],R·(·)denotepenalizationsweightedbythepa- rameterλ· andı· (·) istheindicator functionsdefinedhereon the respective sets W⊂RL×R and H⊂RR×P imposing some con- straintsonthedictionaryandcodingmatrices.

In the case of a linear embedding adopted in this work, the mixturefunctionwrites

ψ

(W,H)=WH. (2)

In thiscontext, the problem (1)can be castas a factor anal- ysisdriven bythe cost functionJr(·). Dependingon theapplica- tive field, typical data-fittingmeasures include the Itakura-Saito, the Euclidean andthe Kullback–Leiblerdivergences [32]. Assum- inga low-rank model(i.e.,RL),specific choicesforthe sets H andWleadtovariousstandard factormodels.Forinstance,when W is chosen asthe Stiefel manifold, the solution of (1) is given byaprincipalcomponentanalysis(PCA)[33].WhenWandHim- posenonnegativityofthedictionary andcodingmatrixelements, theproblemisknownasnonnegativematrixfactorization[19,34]. Withina supervised context, the dictionary W can be chosen thanks toa end-user expertiseorestimatedbeforehand.Without lossof generality butforthe sake ofconciseness, theframework

(5)

describedinthispaperassumesthatthisdictionaryisknown,pos- siblyovercompleteasproposedintheexperimentalillustrationde- scribedinSection4.Inthiscase,asinmanyapplications,itmakes senseto lookfora sparse representationofthesignal ofinterest toretrieveitsmostachievablecompactrepresentation[21,35].Fol- lowingthisstrategy, we propose toconsider an 1-norm sparsity penalizationonthecodingvectors,leadingtorepresentationlearn- ingtaskdefinedby

minH Jr(Y

|

WH)+

λ

h

H

1+ıH(H) (3)

whereH1=P

p=1hp1 withhpdenotingthepthcolumnofH. 2.2.Supervisedclassification

Toclearlydefinetheclassificationtask,letfirstintroducesome key notations. The index subset of samples with an available groundtruth is denoted as L while the index subset of unla- beled samples is U such that LU=∅ and LU=P with P

{1,...,P}.Classifying theunlabeled samplesconsistsin assigning eachofthemtooneoftheCclasses.Thiscanbe reformulatedas theestimationofaC× PmatrixCwhosecolumnscorrespondto unknown C-dimensional attribution vectors cp=

c1,p,. . .,cC,pT

. Eachvector ismadeof0exceptforci,p=1when thepthsample isassignedtheithclass.

Numerous classification ruleshave beenproposed inthe liter- ature[36]. Mostof them rely on a K × P matrix Z=[z1,...,zP] of features zp (pP) associated with each sample and derived fromtherawdata.Withinasupervisedframework,theattribution matrixCLandfeaturematrixZLofthelabeleddataareexploited during the learning step, where ·L denotes the corresponding submatrixwhose columnsare indexedby L.Fora wide rangeof classifiers,derivingaclassificationrulecanbeachievedbysolving theoptimizationproblem

minQ Jc(CL

| φ

(Q,ZL))+

λ

qRq(Q) (4)

whereQ∈RC×K is theset ofclassifier parameters to be inferred, Rq(·) refer to regularizations imposed on Q and Jc is a cost function measuring the quality of the classification such as the quadratic loss [24] or cross-entropy [37]. Moreover, in (4), φ(Q,

· ) defines a element-wise nonlinear mapping between the fea- tures and the class attribution vectors parametrized by Q, e.g., derivedfrom asigmoid ora softmaxoperators.In thiswork, the classifier is assumed to be linear, which leads to a vector-wise post-nonlinearmapping

φ

(Q,ZL)=

φ

(QZL) (5)

with

φ

(X)=[

φ

(x1),. . .,

φ

(xp)]. (6)

Once the classifier parameters have been estimated by solv- ing (4), the unknown attributionvectorsCU can be subsequently inferredduringthetestingstepbyapplyingthenonlineartransfor- mationtothecorrespondingpredictedfeaturesZˆUassociatedwith theunlabeledsamples. Theobtained outputs are relaxedattribu- tionvectorscˆp=φ(Qzˆp)(pU)andthemostprobablepredicted sampleclasscanbecomputedasargmaxici,p.

Under the proposed formulation ofthe classification task,the learning and testing steps can be conducted simultaneously, a framework usually referred to as semi-supervised, withthe ben- eficial opportunity to introduce additional regularizations and/or constraintson the submatrix ofunknown attribution vectorsCU. Theinitialproblem(4)isthusextendedtothefollowingone minQ,CUJc(C

| φ

(QZ))+

λ

qRq(Q)+

λ

cRc(C)+ıC(CU) (7)

Fig. 1. Structure of the cofactorization model. Variables in blue stand for observa- tions or available external data. Variables in olive green are linked through the clus- tering task here formulated as an optimization problem. The variable in a dotted box is assumed to be known or estimated beforehand in this work.

whereC=[CLCU]andC⊂RC×|U|denotesafeasiblesetfortheat- tributionmatrixCU.Asdiscussedabove,thecostfunctionJc(C|Cˆ) measures the actual classification loss, i.e., the discrepancy be- tween the attribution vector C of the trainingset andthe attri- bution vectors predictedby the classifier. Twoparticular cases fittingthisgenericmodelareprovidedinSections3.2.1and3.2.2. The attribution vectors are defined as Cˆ=φ(QZˆ) where φ(·) is a nonlinear function applied to the output of a linear classifier.

The regularization term Rq(Q) penalizes over the parameters of theclassifiers.Atypicalexampleisaquadraticpenalizationwhich aimsatavoidingoverfitting,asconventionallydonewhenoptimiz- ingneuralnetworksandgenerallyreferredtoasweightdecay[38]. Finally, theregularization term Rc(C) penalizes over theattribu- tion matrix.Typicalexamplesinclude spatialregularizationssuch astotalvariation(TV)whendealingwithimageclassification.The indicator functionıC(CU)enforces sum-to-oneandnon-negativity constraints such that each attribution vector cp (pU) can then beinterpretedasaprobabilityvectorofbelongingtoeachclass.In suchacase,thefeasiblesetischosenasC=S|CU|where

SC

u∈RC

k, uk≥0and C

k=1

uk=1

. (8)

2.3. Couplingrepresentationlearningandclassification

Up to this point, the representation learning and supervised classificationtaskshavebeenformulatedastwoindependentma- trixfactorizationproblemsgivenby(3)and(7),respectively.This work proposes to join them by drawing an implicit relation be- tweentwofactorsinvolvedinthesetwoproblems.Inspiredbyhi- erarchicalBayesianmodelssuchastheoneproposedin[30],both problemsarecoupledthroughtheactivationmatricesHandZ,as illustrated in Fig. 1. More precisely, the coding vectors in H are clusteredsuchthatthefeaturevectorsinZaredefinedastheat- tribution vectors to the K clusters. Ideally, clustering attribution vectorszp arefilledwithzerosexceptforzk,p=1whenhp isas- sociated withthekthcluster. Thus, thevectorszp (pP) are as- sumedtobedefinedontheK-dimensionalprobabilitysimplexSK

similarly definedas(8)and ensuringnon-negativity andsum-to- one constraints. Many clustering algorithms can be expressed as optimizationproblemsuch asthe well-knownk-means algorithm and many of its variants [39,40]. Adopting this formulation,and denoting θ the setofparameters of theclusteringalgorithm, the

(6)

Table 1

Overview of notations.

Parameter

P ∈ R Number of observations L R Dimension of observations C ∈ R Number of classes K ∈ R Number of features/clusters P = {1 , . . . , P } Index set of observations L P Index set of labeled samples

L iL Index set of labeled samples in the i th class U = P\L Index set of unlabeled samples

Y R L×P Observations W R L×R Dictionary H R R×P Coding matrix Q C C×P Classifier parameters

C LR |L| Attribution matrix of labeled data C UR |U| Attribution matrix of unlabeled data C = [ C LC U] Class attribution matrix

Z R K×P Cluster attribution matrix θ Clustering parameters

clusteringtaskcanbedefinedastheminimizationproblem min

Z,θ Jg(H,Z;

θ

)+

λ

zRz(Z)+

λ

θRθ(

θ

)+ıSP

K(Z)+ı(

θ

) (9)

wheredefinesafeasiblesetfortheparametersθ.

It is worth noting that introducing this coupling term is one of the major novelty of the proposed approach. When consid- ering task-driven dictionary learning methods, it is usual to in- tertwine the representation learning and the classification tasks by directlyimposing H=Z[24,41].Sincethesemethods generally relyonalinearclassifier,onemajordrawback ofsuch approaches is their unability to deal with non-separable classes in the low- dimensional representation space. In such cases, the underlying modelcannotbediscriminativeanddescriptivesimultaneouslyand theresultingtasksbecomeadversarial.Whenconsideringthepro- posed coupling term, the cluster attribution vectors zp offer the possibility of linearly separating any group of clusters from the others.Asaconsequence,themodelbenefitsfrommoreflexibility, withbothdiscriminativeanddescriptiveabilitiesinamoregeneral sense.

2.4. Globalcofactorizationproblem

Unifyingtherepresentationlearningtask(3)andthe classifica- tiontask(7)throughtheclusteringtask(9)leadstothefollowing jointcofactorizationproblem

Hmin,Q,CU, Z,θ

λ

0Jr(Y

|

WH)+

λ

h

H

1

+

λ

1Jc(C

| φ

(QZ))+

λ

qRq(Q)+

λ

cRc(C) +

λ

2Jg(H,Z;

θ

)+

λ

zRz(Z)+

λ

θRθ(

θ

) +ıH(H)+ıS|U|

K (CU)+ıSP

K(Z)+ı(

θ

) (10)

where λ0,λ1 and λ2 control therespective contribution of each taskdata-fittingterm.Allnotationsandparameterdimensionsare summarizedin Table1.Ageneric algorithmicscheme solvingthe problem(10)isproposedinthenextsection.

2.5. Optimizationscheme

Theminimization problemdefinedby (10)isnotgloballycon- vex.Toreachalocalminimizer,weproposetoresorttotheprox- imal alternating linearized minimization (PALM) algorithm intro- ducedin[31].Thisalgorithm isbasedon proximaldescent steps, which allows non-smooth terms to be handled. Moreover it is guaranteed to converge to a critical point of the objective func- tion evenin thecaseofnon-convexproblem. Thismeans that, if

theinitializationisgoodenough, itisexpectedtolikelyconverge toasolutionclosetotheglobaloptimum.ToimplementPALM,the problem(10)isrewrittenintheformofanunconstrainedproblem expressedasasumofa smoothcouplingtermg(·)andseparable non-smoothtermsfj(·)(j∈{0,...,4})asfollows

min

H,θ,Z, Q,CU

f0(H)+f1(

θ

)+f2(Z)+f3(CU)+g(H,

θ

,Z,CU,Q) (11)

where

f0(H)=ıH(H)+

λ

h

H

1 f2(Z)=ıSP

K(Z) f1(

θ

)=ı(

θ

) f3(CU)=ıS|U|

K (CU)

andthecouplingfunctionis g(H,

θ

,Z,CU,Q)=

λ

0Jr(Y

|

WH)

+

λ

1Jc(C

| φ

(QZ))+

λ

qRq(Q)+

λ

cRc(C) +

λ

2Jg(W,Z;

θ

)+

λ

zRz(Z)+

λ

θRθ(

θ

). (12) To ensure the stated guarantees of PALM, all fj(·) have to be proper, lower semi-continuous function fj:Rnj(−∞,+∞], whichensures inparticularthatthe associatedproximaloperator iswell-defined.Additionally,sufficientconditionson thecoupling function are that g(·) is a C2 function (i.e., withcontinuous first andsecond derivatives)andthat its partial gradientsare globally Lipschitz.Forexample, partialgradient ∇Hg(H,θ,Z,CU,Q) should begloballyLipschitzforanyfixedθ,Z,CU,Q,thatis

Hg(H1,

θ

,Z,CU,Q)

Hg(H2,

θ

,Z,CU,Q)

LH(

θ

,Z,CU,Q)

H1H2

,

H1,H2RR×P (13)

whereLH(θ,Z,CU,Q),simplydenotedLHhereafter,istheLipschitz constant. Forsake of conciseness,we referto [31] to get further details.

Themainideaofthealgorithmisthentoupdateeachvariable oftheproblemalternativelyusingaproximalgradientdescent.The overallschemeissummarizedinAlgorithm1.Forapracticalim-

Algorithm1:PALM.

1 Initialize variables H 0, θ0, Z 0, C U0and Q 0; 2 Set α> 1 ;

3 while stopping criterion not reached do

4 H k+1prox αf0LH(H kα1LHHg(H k, θk, Z k, C kU, Q k)); 5 θk+1prox αf1Lθ(θkα1Lθθg(H k+1, θk, Z k, C kU, Q k)); 6 Z k+1prox αf2LZ(Z kα1LZZg(H k+1, θk+1, Z k, C kU, Q k)); 7 Q k+1prox αf3LQ(Q kα1LQQg(H k+1, θk+1, Z k+1, C Uk, Q k)); 8 C kU+1prox αf4LCU(C kUαL1C

UCUg(H k+1, θk+1, Z k+1, C kU, Q k+1)); 9 end

10 return H end, θend, Z end, Q end, C endU

plementation, one needs to compute thepartial gradients ofg(·) explicitlyandtheir Lipschitz constantsto perform a gradientde- scent step, followed by a proximal mapping associated with the non-smoothterms fj(·). The objective function is then monitored ateachiteration andthealgorithm isstoppedwhen convergence is reached. Note that, when a specific penalization R·(·) is non- smoothornon-gradient-Lipschitz,itispossibletomoveitintothe correspondingindependenttermfj(·)toensuretherequiredprop- ertyofthecouplingfunctiong(·).Thisisforinstancethecasefor thesparse penalization used over H whichhas been moved into f0(·). Nonetheless, asmentionedabove, the proximal operatoras- sociated witheach fj(·) is needed. Thus, even when the function consistsofseveralterms,aclosed-formexpressionofthisoperator shouldbeknown.Alternatively,oneshouldbeabletocomposethe proximaloperatorsassociatedwitheachtermoffj(·)[42].

(7)

Fig. 2. Spectral unmixing concept (source US Navy NEMO).

3. Application:hyperspectralimagesanalysis

Ageneralframeworkhasbeenintroduced intheprevioussec- tion.Asanillustration,aparticularinstanceofthisgenericframe- work is now considered, where explicit representation learning, classificationandclusteringareintroduced.Thespecificcaseofhy- perspectralimagesanalysisisconsideredforthisusecaseexample.

Contrarytoconventionalcolorimagingwhichonlycapturesthe reflectance measure forthree wavelengths (red, blue, green), hy- perspectralimagingmakesitpossibletomeasurereflectanceofthe observedsceneforseveralhundredsofwavelengthsfromvisibleto invisibledomain.Eachpixeloftheimagecanthusbe represented asavectorofreflectance,calledspectrum,whichcharacterizesthe observedmaterial.

Onedrawbackofhyperspectralimagesisusuallyaweakerspa- tial resolution due to sensor limitations. The direct consequence of this poor spatial resolution is the presence of mixed pixels, i.e.,pixelscorrespondingtoareascontainingseveralmaterials.Ob- servedspectra arein thiscasetheresult ofa specificmixture of theelementaryspectra,calledendmembers, associatedwithindi- vidualmaterialspresentinthepixel.Theproblemofretrievingthe proportionsofeachmaterialineachpixelisreferredtoasspectral unmixing[11].Thisproblemcanbeseenasaspecificcaseofrep- resentationlearningwherethedictionaryiscomposedofthesetof endmembersstandingfortheendmember spectraandthecoding matrixistheso-calledabundancematrixcontainingtheproportion ofeachmaterialineachpixel.

Spectral unmixing is introduced as a representation learning taskinSection 3.1.The specificclassifierusedforthisapplication isthenexplainedinSection3.2andfinallySection3.3presentsthe clusteringadoptedtorelatetheabundancematrixandtheclassifi- cationfeaturematrix.

3.1.Spectralunmixing

Asexplained,eachpixelofanhyperspectralimageischaracter- ized by a reflectance spectrum that physics theory approximates asa combination of endmembers, each corresponding to a spe- cificmaterial,asillustrated in Fig.2. Formally,inthis applicative scenario,the L-dimensional sampleyp denotes the L-dimensional spectrumofthepthpixelofthehyperspectralimage(pP).Each observationvectorsyp canbeexpressedasafunctionoftheend- membermatrix W(containingthe Relementaryspectra) andthe abundancevectorhp∈RRwithRL.

In the case of the most commonly adopted linear mixture model, each observation yp is assumed to be a linear combina-

tionoftheendmemberspectrawr(r=1,...,R)corruptedbysome noise,underlyingthe linearembedding(2).Assuming aquadratic data-fittingterm,thecostfunctionassociatedwiththerepresenta- tionlearningtaskin(1)iswritten

Jr(Y

|

WH)=1

2

YWH

2F. (14)

The abundance vector hp is usually interpreted asa vector of proportionsdescribingtheproportionofeachelementarycompo- nent in thepixel. Thus, to derive an additive composition of the observed pixels, a nonnegative constraint is considered for each element of theabundance matrix H, i.e., H=RR+×P. In thiswork, no sum-to-one constraintis considered since it has been argued thatleavingthisconstraintoutoffersabetteradaptationtopossi- blechanges ofilluminationinthe scene[43].Additionally,asthe endmember matrix W is the collection of reflectance spectra of theendmembers,itisalsoexpectedtobenon-negative.Whenthis dictionaryneedstobeestimated,theresultingproblemisasparse non-negative matrix factorization (NMF) task. When the dictio- naryisknownorestimatedbeforehand,theresultingoptimization problemisthenonnegativesparsecodingproblem

minH

1

2

YWH

2F+

λ

h

H

1+ıRR×P

+ (H) (15)

wherethe sparsitypenalization actuallysupports the assumption thatonlyafewmaterialsarepresentinagivenpixel.

3.2. Classification

In the considered application, two loss functions associated with the classification problem have been investigated, namely quadraticlossandcross-entropyloss.Oneadvantageofthesetwo loss functions is that they can be used in a multi-class classi- fication (i.e., withmore than two classes). Moreover, this choice mayfulfill therequired conditions statedin Section 2.5 to apply PALM since,coupled withan appropriate φ(·) function, both loss costs are smooth and gradient-Lipschitz according to each esti- matedvariables.

3.2.1. Quadraticloss

Thequadraticlossisthemostsimplewaytoperformaclassifi- cationtaskandhavebeenextensivelyused[25,44,45].Itisdefined as

Jc(C

|

Cˆ)=1

2

CDCˆD

2

F (16)

where Cˆ denotes the estimated attribution matrix. In (16), the P × P matrix D is introduced to weight the contribution of the labeled data withrespect tothe unlabeledone and to dealwith the case of unbalanced classes in the training set. Weights are chosen to be inversely proportional to class frequencies in the input data. The weight matrix is defined as the diagonal matrix D=diag[d1,...,dP]with

dp=

⎧ ⎨

1

|Li|, if pLi;

1

|U|, if pU; (17)

where Li denotes the set of indexes of labeled pixels of the ith class(i=1,...,C).Thus,consideringalinearclassifier,thegeneric classificationproblemin(7)canbespecifiedforthequadraticloss minQ,CU

1

2

CDQZD

2F+

λ

cRc(C)+ıS|U|

C (CU) (18)

wherenoadditionalconstraintsnorpenalizationisappliedtothe classifierparametersQ.Besides,whensamplesobeyaspatiallyco- herentstructure,asitisthecasewhenanalyzinghyperspectralim- ages,it isoftendesirabletotransfer thisstructureto theclassifi- cationmap.Such acharacteristicscanbe achievedby considering

(8)

a spatialregularization Rc(C)applied to the attributions vectors.

Followingthisassumption,thisworkconsidersaregularizedcoun- terpart of theweightedvectorial totalvariation (vTV), promoting a spatially piecewise constant behavior of the classification map [46]

C

vTV=

m,n

β

m,n

[

hC]m,n

2

2+

[

vC]m,n

2

2+ (19)

where(m,n) are thespatialpositionpixelindexes and[∇h(·)]m,n

and [∇v(·)]m,n stand for horizontal andvertical discrete gradient operatorsevaluatedatagivenpixel,1respectively,i.e.,

[

hC]m,n=c(m+1,n)c(m,n) [

vC]m,n=c(m,n+1)c(m,n).

Theweightsβm,ncanbecomputedbeforehandtoadjustthepe- nalizationswithrespecttoexpectedspatialvariationsofthescene.

Theycan beestimateddirectlyfromtheimage to beanalyzed or extracted froma complementary datasetas in[47]. Theywill be specifiedduringtheexperimentsreportedinSection4.Moreover, the smoothing parameter > 0 ensures the gradient-Lipschitz propertyofthecouplingtermg(·),asrequiredinSection2.5.

3.2.2. Cross-entropyloss

The quadratic loss has the advantage to be expressed sim- ply andthe associated Lipschitz constant ofthe partial gradients are triviallyobtained. However, thisloss functionis knownto be highlyinfluencedby outlierswhich canresultinadegraded pre- dictive accuracy [48]. A more sophisticated way to conduct the classificationtaskistoconsideracross-entropyloss

Jc(C

|

Cˆ)=−

p∈P

d2p

i∈C

ci,plog

cˆi,p

(20)

combinedwithalogisticregression,i.e.,wherethenonlinearmap- ping(5)iselement-wisedefinedas

[

φ

(X)]i,j= 1

1+exp(−xi,j)=sigm(xi,j) (21)

withi∈{1,...,C} and pP.Thisclassifier can actually be inter- pretedasaone-layerneuralnetworkwithasigmoidnon-linearity.

Cross-entropy loss is indeeda very conventional loss function in theneural network/deeplearningcommunity[38].Inthe present case,thecorrespondingoptimizationproblemcanbewritten minQ,CU

p∈P

d2p

i∈C

ci,plog(sigm(qi:zp))) +

λ

qRq(Q)+

λ

c

C

vTV+ıS|U|

C (CU) (22)

whereqi:∈R1×K denotesthe ithlineofthematrixQ.Thepenal- izationRq(Q)isherechosenasRq(Q)= 12Q2Ftopreventtheloss functiontoartificiallydecreasewhenqi:2isincreasing.Thisreg- ularizationhasbeenextensivelystudiedintheneuralnetworklit- erature where it is referred to asweight decay [38]. In (22),the regularization Rc(CU)applied to theattribution matrixis chosen againasavTV-likepenalization(see(19)).

3.3. Clustering

Fortheconsideredapplication,theconventionalk-meansalgo- rithm hasbeenchosenbecause ofitsstraightforwardformulation asan optimization problem. Bydenotingθ={B} a R × Kmatrix collecting K centroids, theclusteringtask(9) canbe rewritten as thefollowingNMFproblem[40]

minZ,B

1

2

HBZ

2F+

λ

zRz(Z)+ıSP

K(Z)+ıRR×K

+ (B) (23)

1With a slight abuse of notations, c (m,n)refers to the p th column of C where the p th pixel is spatially indexed by ( m, n ).

where Rz(Z) should promote Z to be composed of orthogo- nallines. Combined withthe nonnegativityand sum-to-onecon- straints,itwouldensure thatzp isavectorofzerosexceptforits kthcomponentequalto1,i.e.,meaningthatthepthpixelbelongs to the kthcluster. However, handlingthis orthogonality property withinthePALMoptimizationschemedetailedinSection2.5isnot straightforward,inparticularbecausetheproximal operatorasso- ciated to thispenalization cannot be explicitly computed. In this work,weproposetoremovethisorthogonalityconstraintsincere- laxedattributionvectorsmaybericherfeaturevectorsfortheclas- sificationtask.

3.4.Multi-objectiveproblem

Based on the quadratic and cross-entropy loss functions con- sideredinthe classificationtask,two distinct globaloptimization problems are obtained. When considering the quadratic loss of Section3.2.1,themulti-objectiveproblem(10)writes

minH,Q,Z CU,B

λ

0

2

YWH

2F+

λ

h

H

1+ıRR×P + (H) +

λ

1

2

CDQZD

2F+

λ

c

C

vTV+ıS|U| C (CU) +

λ

2

2

HBZ

2F+ıSP

K(Z)+ıRR×K

+ (B). (24)

Instead,whenconsideringthe cross-entropyloss functionpro- posed in Section 3.2.2, the optimization problem (10) is defined as

minH,Q,Z CU,B

λ

0

2

YWH

2F+

λ

h

H

1+ıRR×P + (H)

λ

1

2

p∈P

d2p

i∈C

ci,plog(sigm(−qi:zp)))

+

λ

q

2

Q

2F+

λ

c

C

vTV+ıS|U| C (CU) +

λ

2

2

HBZ

2F+ıSP

K(Z)+ıRR+×K(B). (25)

Bothproblems are particular instances of nonnegative matrix co-factorization [27,28]. Tosummarize, the hyperspectral pixel is firstdescribedasacombinationofelementaryspectrathroughthe learningrepresentationstep,akaspectral unmixing.Then, assum- ingthatthereexistgroupsofpixelsresultingfromthesamemix- tureofmaterials,a clusteringisperformedamongtheabundance vectors. And finally, attribution vectors to the clusters are used as feature vectors for the classification supporting the idea that classesaremadeofamixtureofclusters.Forbothmulti-objective problems(24)and(25),allconditionsrequiredtotheuseofPALM algorithm describedinSection 2.5 aremet. Detailsregarding the twooptimizationschemesdedicatedtothesetwoproblemsarere- portedintheAppendix.

3.5.Complexityanalysis

Regarding the computational complexity of the proposed Algorithm1,deriving thegradientsshowsthat itisdominatedby matrixproductoperations.Ityieldsthatthealgorithmhasanover- allcomputationalcostinO(NK2P)whereNisthenumberofiter- ations.

4. Experiments

4.1. Implementationdetails

Before presenting the experimental results, it is worth clari- fying the choices which havebeen made regarding the practical

Références

Documents relatifs

The most crucial step during the machine learning process is interpreting the results, and RTextTools provides a function called create_analytics() to help users understand

Similarly to [8], the coupling term is defined as a clustering term where the abundance vectors provided by the unmixing step are clustered and the resulting attribution vectors

“Crop Type Identification and Mapping Using Machine Learning Algorithms and Sentinel-2 Time Series Data,” IEEE Journal of Selected Topics in Applied Earth Observations and

A chirped pulse millimeter wave (CP- mmW) setup was combined with the buffer gas cooling apparatus for combined laser and millimeter wave spectroscopy experiments

In the videos, the instance of SubCMedians uses a value of 100 for parameter SD max , a sliding window containing 100 data points (size of data sample), and performs a local

La réserve de biosphère de Fontainebleau et du Gâtinais, dans la région Ile-de-France, et celle de La Campana-Peñuelas, dans la macro-région métropolitaine de

In our study, sampling time affects only the quantity of cultivable fungi and not the viable bacterial cells, while no effects are measured on the soil microbial community

Crystal structures (thermal ellipsoids at the 50 % probability level) and HOMO- LUMO diagram of a) CzAnCN and b) DPAAnCN. An attachment/detachment formalism was used