Maximum likelihood covariance matrix estimation from two possibly mismatched data sets

(1)

HAL Id: hal-02572461

https://hal.archives-ouvertes.fr/hal-02572461

Submitted on 13 May 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Maximum likelihood covariance matrix estimation from two possibly mismatched data sets

Olivier Besson

To cite this version:

Olivier Besson. Maximum likelihood covariance matrix estimation from two possibly mismatched data sets. Signal Processing, Elsevier, 2020, 167, pp.107285-107294. �10.1016/j.sigpro.2019.107285�.

�hal-02572461�

(2)

an author's https://oatao.univ-toulouse.fr/25984

https://doi.org/10.1016/j.sigpro.2019.107285

Besson, Olivier Maximum likelihood covariance matrix estimation from two possibly mismatched data sets. (2020) Signal Processing, 167. 107285-107294. ISSN 0165-1684

(3)

Maximum likelihood covariance matrix estimation from two possibly mismatched data sets

Olivier Besson

ISAE-SUPAERO, 10 Avenue Edouard Belin, Toulouse 31055, France

Keywords:

Covariance matrix estimation Maximum likelihood Mismatch

a b s t r a c t

Weconsiderestimatingthecovariancematrixfromtwodatasets,onewhosecovariancematrixR1isthe soughtoneandanothersetofsampleswhosecovariancematrixR₂slightlydiffersfromthesoughtone, duee.g.todifferentmeasurementconﬁgurations.Weassumehoweverthatthetwomatricesarerather close, whichweformulatebyassuming thatR¹₁^/²R⁻¹₂ R¹₁^/²|^R1 followsaWishartdistributionaround the identitymatrix.Itturnsoutthatthisassumptionresultsintwodatasetswithdifferentmarginaldistri- butions,hencetheproblembecomesthatofcovariancematrixestimationfromtwodatasetswhichare distribution-mismatched.Themaximumlikelihoodestimator(MLE)isderived andisshowntodepend onthevaluesofthenumberofsamplesineachset.Weshowthatitinvolveswhiteningofonedataset bytheotherone,shrinkage ofeigenvaluesand colorization,atleastwhenonedatasetcontainsmore samplesthanthesizepoftheobservationspace.Whenbothdatasetshavelessthanpsamplesbutthe totalnumberislargerthanp,theMLEagainentailseigenvaluesshrinkagebutthistimeafteraprojection operation.Simulationresultscomparethenewestimatortostate of the art techniques.

1. Problemstatement

Analysisorprocessingofmultichanneldatamostoftenrelieson thecovariancematrix,whichisafundamentaltoole.g.,forprinci- palcomponentanalysis,spectralanalysis, adaptiveﬁltering,detec- tion,directionofarrivalestimationamongothers[1–3].Inpractical applications,the p×pcovariancematrix Rneeds tobe estimated froma ﬁnitenumbernofsamples.When thelatterareindepen- dentandGaussiandistributed,themaximumlikelihoodestimator ofRisn⁻¹S whereXisthe p×ndatamatrixandS=XX^T is the samplecovariancematrix(SCM)[1].However,inlowsamplesup- port orwhen deviationfromtheGaussian assumptionisathand, theSCMtendstobehavepoorly.Inparticularitwasobservedthat thesamplecovariancematrixisusuallylesswell-conditionedthan the true covariancematrix, and thereforeconsiderable efforthas been dedicatedto regularizingit withaview to improveits performance.

One ofthe mostimportantapproach in thisrespectis dueto Stein [4–6] who, instead of maximizing the likelihood function, advocated tominimize a meaningfullossfunction within agiven classofestimators.Steinhenceintroducedtheconceptofadmissi- bleestimationandminimaxestimatorsundertheso-calledStein’s loss.HeshowedthattheSCM-basedestimatorisnotminimaxand

E-mail address: [email protected]

derivedminimax estimators intwoimportantclasses,namelyes- timatorsofthe formRˆ=GDG^T whereDisadiagonalmatrixand Gisthe CholeskyfactorofS,orof theformRˆ=Udiag

ϕ(λ)

U^T whereUdiag(λ⁾Û^Tîs^theêigenvaluedecompositionofSandϕ⁽λ⁾

isanon-linearfunction ofλ^.^This^seminal^work^of^Stein^gave^rise

to a great number of studies, see forinstance [7–13] andrefer- encestherein.Asecond classofrobust estimatesisbasedonlin- ear shrinkageof the SCMto a target matrix(an approach which can be interpreted as an empirical Bayes technique), i.e., esti- matesof theformRˆ=α^Rt+β^S ^where^Rt=I isthe mostwidely spread choice, see e.g., [14–20]. Note that these techniques ap- plied with Rt=I achieve an aﬃne transformation of the eigen- values of S, while retaining the eigenvectors, andtherefore bear resemblancewith Stein’smethod, although theselection ofα^, β

may not be driven by the same principle. Robustness to a possibly non Gaussian distribution has also been a topic of consid- erable interest andmany papers havefocused on robust estima- tionforellipticallydistributeddata,seee.g.,[21–30]andreferences therein.

Mostof the above cited works deal with estimation of a co- variancematrixfromasingledataset.Inthispaper,weconsidera situationwheretwodatasetsX₁andX₂areavailable,withrespec- tive covariancematricesR₁ andR₂.Thissituationtypically arises inradarapplicationswhenonewishestodetectatargetburiedin clutter with unknown statistics [31,32].In order to infer the lat- ter,trainingsamplesaregenerallyused,whichhopefullysharethe https://doi.org/10.1016/j.sigpro.2019.107285

(4)

samestatisticsastheclutterinthecellundertest(CUT).However, it has been evidenced that clutter is most often heterogeneous [31],withadiscrepancycomparedtotheCUTthatmaygrowwith the distance to the CUT [33]. Therefore, one is led to use some clustering that separates training samples, either based on their proximity to the CUT or by means of some statistical criterion, suchasthepowerselectedtraining[34].Thesamplessoselected are deemed to be representative ofthe clutter inthe CUT while othersarelessreliable,whichcorrespondstothesituationconsid- eredherein. A second example is in the ﬁeld of synthetic aper- tureradarinthecasewhereasceneisimagedontwoconsecutive days,withpossiblechangesinbetween[35].Finally,inhyperspec- tralimagery,theproblemoftarget oranomaly detectionleadsto averysimilarframework.Indeed,thebackgroundinapixelunder testhastobeestimatedfromthelocalpixelsaroundandpixelslo- catedfurtherapart[36].Inthepresentpaper,we assumethat R₂ isclosetoR₁,thecovariancematrixwewishtoestimate.SinceR₂ differs frombutis closeto R₁ we investigateusing both X₁ and X₂ toestimateR₁.ThereasonforusingalsoX₂ isthatdespiteits covariancematrixisnot R₁,itiscloseto.Additionally,one might facesituationswherethe numberofsamplesinX₁ is verysmall.

Thispaperconstitutesafirstapproachtothisspecificproblemand wefocushereinonthe mostnaturalapproach,namelymaximum likelihoodestimation. The objectiveis to figureout thepros and consof the latterand the conditionsunder which it is an accu- rateestimator.The paperisorganizedasfollows.Insection 2we formulate thestatistical assumptions:more precisely, we assume thatR¹₁^/²R⁻₂¹R¹₁^/²|^R1 isa random matrixwitha Wishart distribution around the identity matrix, and we derive the joint distribution of (X₁, X₂). Section 3 is devoted to the derivation of the maximumlikelihoodestimatorofR₁from(X₁,X₂),takingintoac- countthepossibleconfigurationsregardingthenumberofsamples ineachdataset.Numericalsimulationsillustratetheperformance oftheMLEandcompareitwithexistingalternatives insection 4. Conclusionsandpossibleextensionsofthepresentworkaredrawn insection5.

2. Datamodel

Let us assume that we have two sets of measurements X₁(p×n₁) and X₂(p×n₂) which are distributed according to X₁=^d N(⁰,R₁,I) ^and ^X2 d

=N(⁰,R₂,I) ^where N(⁰,,) ^de- notes the matrix-variate normal distribution whose density is (²π)⁻^pn^/²||⁻ⁿ^/²||⁻^p^/²^etr{−¹₂X^T⁻¹^X⁻¹} ^with ^|.|^the ^determi-

nant and etr{.} the exponential of the trace of a matrix. Note that we consider real-valued data here whereas in radar applications it is customary to consider complex-valued signals. In Appendix A we show how the results below can be readily ex- tended to the complex case. Our goal in this paper is to esti- mateR₁, usingboth X₁ andX₂ even if R₁=R₂.However we assumethat the two matrices are close to each other. In orderto deﬁne a model that can reﬂect the proximity between R₁ and R₂, we note that the natural distancebetween them is givenby d²(^R1,R₂)=p

k=1log²λk(^G^T₁^R⁻₂¹^G1)^[37,38]^where^G1 isasquare- root of R₁, i.e., R₁=G₁G^T₁ and λk(^G^T1R⁻¹₂ G₁) ^stands ^for ^the ^k^th eigenvalue of G^T₁R⁻₂¹G₁. This matrix is pivotal in adaptive detec- tionproblemsalso.Moreprecisely,inthecaseofacovariancemis- matchbetweenthetrainingsamplesandthedataundertest,itis shownin [39] that the performance of the well-known adaptive matchedﬁlterdependsessentiallyonthismatrix.Therefore,itbe- comes natural to encapsulate the difference between R₁ and R₂ through the matrixW=G^T₁R⁻₂¹G₁ and its proximity to the identity matrix. There are of course different ways to translate this constraintinthe model.Forinstance afrequentist approach may beadvocated wherethe jointprobability densityfunction of(X₁,

X₂) would be maximized under the constraintthat the distance between W and I is smaller than some value. Alternatively, and thisis what we elect here, one can resortto an empirical Bayes approach wherethe randommatrix W followssome prior distri- butionratherconcentratedaroundI.Formathematicaltractability, we choose a conjugate prior for W and we assume that W fol- lowsaWishartdistributionwithν^degrees^of^freedom^and^param-

etermatrixμ⁻¹^I,i.e.,W=^dWp ν,μ⁻¹^I

.Ofcourse,thisisarather strongassumptionwhosevaliditywouldbediﬃculttocheck,e.g., on realdata. However, it is inaccordance withthe mere knowl- edgewehaveabouttherelationbetweenR₁andR₂,anditallows fortractablederivations.

UsingthefactthatX₁|R₁andX₂|R₂areindependentandGaus- sian distributed with respective covariance matrices R₁ and R₂, andsinceR2=G1W⁻¹G^T₁,wethusassumethefollowingstochastic model:

p(^X1,X₂|^R1,W)=(²π)⁻^p⁽ⁿ¹⁺ⁿ²⁾^/²|^R¹|⁻ⁿ¹^/²W⁻¹R₁⁻ⁿ²^/²

×etr

−1

2X^T₁R⁻₁¹X1−1

2X^T₂G⁻₁^TWG⁻₁¹X2

(1a)

p(^W)= μ^ν^p^/²

2^ν^p^/²p(ν^/²)|^W|^(ν^−p−1⁾^/²^etr

−1 2μ^W

(1b) Note that E

W⁻¹ =(ν−p−1)⁻¹μ^I ^so ^that E{^R²}= E

G₁W⁻¹G^T₁ =(ν−p−1)⁻¹μ^R1:therefore,forE{^R2}^to^be^equal

toR₁,one mustselectμ=ν−p−1.Observealso thatW comes closerto I asν ^grows ^large.Îndeed, E{^W}=ν(ν−p−1)⁻¹Î ând E

(^W−E{^W})² =pν(ν−p−1)²^I ^which ^goes^to ^zero ^asν→∞ [40].

Themarginaldistributionof(X₁,X₂) isobtainedbyintegrating (1)withrespecttoW,whichresultsin

p(^X1,X2|^R¹)=

W>0

p(^X1,X2|^R¹^,^W)^p(^W)^dW

=(²π)⁻^p⁽ⁿ¹⁺ⁿ²⁾^/²μ^ν^p^/²

2^ν^p^/²p(ν^/²) |^R¹|⁻⁽ⁿ¹⁺ⁿ²⁾^/²^etr

−1

2X^T₁R⁻¹₁ X1

×

W>0|^W|^(ν⁺ⁿ²^−p−1⁾^/²^etr

−1 2W

μ^I⁺^G⁻¹1 X2X^T₂G^−T₁

dW

=(²π)⁻^p⁽ⁿ¹⁺ⁿ²⁾^/²μ^ν^p^/²

2^ν^p^/²p(ν^/²) ²^(ν⁺ⁿ²⁾^p/²p((ν⁺ⁿ²)/2)

×|^R¹|⁻⁽ⁿ¹⁺ⁿ²⁾^/²μ^I⁺^G⁻¹1 X2X^T₂G^−T₁ ⁻^(ν⁺ⁿ²⁾^/²etr

−1

2X^T₁R⁻¹₁ X1

=(²π)^−pn¹^/²|^R¹|⁻ⁿ¹^/²^etr

−1

2X^T₁R⁻¹₁ X1

×π^−pn²^/²^p((ν⁺ⁿ2)^/²)

^p(ν^/²) |μ^R1|⁻ⁿ²^/²I+X^T₂[μ^R1]⁻¹X₂⁻^(ν⁺ⁿ²⁾^/²

(2) Inordertoobtainthethirdequality,wemadeuseofthefactthat, ifS=^dWp(ν,),

S>0

p(^S)^dS⁼¹^⇒

S>0|^S|^(ν⁻^p⁻¹⁾^/²^etr

−1 2S⁻¹

dS

=2^ν^p^/²^p(ν/2)||^ν^/² ⁽³⁾

Note that p(X₁, X₂|R₁) in (2) can be factored as p(^X1,X₂|^R1)=f₁(^X1,R₁)×f₂(^X2,R₁) ^which ^shows ^that ^X1

and X₂ are marginally independent and that p(^X1,X₂|^R1)= p(^X1|^R1)^p(^X2|^R1) ^with ^p(^X1|^R1)∝etr

−¹₂X^T₁R⁻¹₁ X₁ and p(^X2|^R1)∝I+X^T₂[μ^R1]⁻¹X₂⁻^(ν⁺ⁿ²⁾^/².Duetothemodeladopted fortherandommatrixW=G^T₁R⁻₂¹G1,X2 followsamatrixvariate Student distribution [41]. Therefore, the fact that R₂=R₁ results here in to two data sets with different distributions: one set