HAL Id: hal-02572461
https://hal.archives-ouvertes.fr/hal-02572461
Submitted on 13 May 2020
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Maximum likelihood covariance matrix estimation from two possibly mismatched data sets
Olivier Besson
To cite this version:
Olivier Besson. Maximum likelihood covariance matrix estimation from two possibly mismatched data sets. Signal Processing, Elsevier, 2020, 167, pp.107285-107294. �10.1016/j.sigpro.2019.107285�.
�hal-02572461�
an author's https://oatao.univ-toulouse.fr/25984
https://doi.org/10.1016/j.sigpro.2019.107285
Besson, Olivier Maximum likelihood covariance matrix estimation from two possibly mismatched data sets. (2020) Signal Processing, 167. 107285-107294. ISSN 0165-1684
Maximum likelihood covariance matrix estimation from two possibly mismatched data sets
Olivier Besson
ISAE-SUPAERO, 10 Avenue Edouard Belin, Toulouse 31055, France
Keywords:
Covariance matrix estimation Maximum likelihood Mismatch
a b s t r a c t
Weconsiderestimatingthecovariancematrixfromtwodatasets,onewhosecovariancematrixR1isthe soughtoneandanothersetofsampleswhosecovariancematrixR2slightlydiffersfromthesoughtone, duee.g.todifferentmeasurementconfigurations.Weassumehoweverthatthetwomatricesarerather close, whichweformulatebyassuming thatR11/2R−12 R11/2|R1 followsaWishartdistributionaround the identitymatrix.Itturnsoutthatthisassumptionresultsintwodatasetswithdifferentmarginaldistri- butions,hencetheproblembecomesthatofcovariancematrixestimationfromtwodatasetswhichare distribution-mismatched.Themaximumlikelihoodestimator(MLE)isderived andisshowntodepend onthevaluesofthenumberofsamplesineachset.Weshowthatitinvolveswhiteningofonedataset bytheotherone,shrinkage ofeigenvaluesand colorization,atleastwhenonedatasetcontainsmore samplesthanthesizepoftheobservationspace.Whenbothdatasetshavelessthanpsamplesbutthe totalnumberislargerthanp,theMLEagainentailseigenvaluesshrinkagebutthistimeafteraprojection operation.Simulationresultscomparethenewestimatortostate of the art techniques.
1. Problemstatement
Analysisorprocessingofmultichanneldatamostoftenrelieson thecovariancematrix,whichisafundamentaltoole.g.,forprinci- palcomponentanalysis,spectralanalysis, adaptivefiltering,detec- tion,directionofarrivalestimationamongothers[1–3].Inpractical applications,the p×pcovariancematrix Rneeds tobe estimated froma finitenumbernofsamples.When thelatterareindepen- dentandGaussiandistributed,themaximumlikelihoodestimator ofRisn−1S whereXisthe p×ndatamatrixandS=XXT is the samplecovariancematrix(SCM)[1].However,inlowsamplesup- port orwhen deviationfromtheGaussian assumptionisathand, theSCMtendstobehavepoorly.Inparticularitwasobservedthat thesamplecovariancematrixisusuallylesswell-conditionedthan the true covariancematrix, and thereforeconsiderable efforthas been dedicatedto regularizingit withaview to improveits per- formance.
One ofthe mostimportantapproach in thisrespectis dueto Stein [4–6] who, instead of maximizing the likelihood function, advocated tominimize a meaningfullossfunction within agiven classofestimators.Steinhenceintroducedtheconceptofadmissi- bleestimationandminimaxestimatorsundertheso-calledStein’s loss.HeshowedthattheSCM-basedestimatorisnotminimaxand
E-mail address: [email protected]
derivedminimax estimators intwoimportantclasses,namelyes- timatorsofthe formRˆ=GDGT whereDisadiagonalmatrixand Gisthe CholeskyfactorofS,orof theformRˆ=Udiag
ϕ(λ)
UT whereUdiag(λ)UTistheeigenvaluedecompositionofSandϕ(λ)
isanon-linearfunction ofλ.ThisseminalworkofSteingaverise
to a great number of studies, see forinstance [7–13] andrefer- encestherein.Asecond classofrobust estimatesisbasedonlin- ear shrinkageof the SCMto a target matrix(an approach which can be interpreted as an empirical Bayes technique), i.e., esti- matesof theformRˆ=αRt+βS whereRt=I isthe mostwidely spread choice, see e.g., [14–20]. Note that these techniques ap- plied with Rt=I achieve an affine transformation of the eigen- values of S, while retaining the eigenvectors, andtherefore bear resemblancewith Stein’smethod, although theselection ofα, β
may not be driven by the same principle. Robustness to a pos- sibly non Gaussian distribution has also been a topic of consid- erable interest andmany papers havefocused on robust estima- tionforellipticallydistributeddata,seee.g.,[21–30]andreferences therein.
Mostof the above cited works deal with estimation of a co- variancematrixfromasingledataset.Inthispaper,weconsidera situationwheretwodatasetsX1andX2areavailable,withrespec- tive covariancematricesR1 andR2.Thissituationtypically arises inradarapplicationswhenonewishestodetectatargetburiedin clutter with unknown statistics [31,32].In order to infer the lat- ter,trainingsamplesaregenerallyused,whichhopefullysharethe https://doi.org/10.1016/j.sigpro.2019.107285
samestatisticsastheclutterinthecellundertest(CUT).However, it has been evidenced that clutter is most often heterogeneous [31],withadiscrepancycomparedtotheCUTthatmaygrowwith the distance to the CUT [33]. Therefore, one is led to use some clustering that separates training samples, either based on their proximity to the CUT or by means of some statistical criterion, suchasthepowerselectedtraining[34].Thesamplessoselected are deemed to be representative ofthe clutter inthe CUT while othersarelessreliable,whichcorrespondstothesituationconsid- eredherein. A second example is in the field of synthetic aper- tureradarinthecasewhereasceneisimagedontwoconsecutive days,withpossiblechangesinbetween[35].Finally,inhyperspec- tralimagery,theproblemoftarget oranomaly detectionleadsto averysimilarframework.Indeed,thebackgroundinapixelunder testhastobeestimatedfromthelocalpixelsaroundandpixelslo- catedfurtherapart[36].Inthepresentpaper,we assumethat R2 isclosetoR1,thecovariancematrixwewishtoestimate.SinceR2 differs frombutis closeto R1 we investigateusing both X1 and X2 toestimateR1.ThereasonforusingalsoX2 isthatdespiteits covariancematrixisnot R1,itiscloseto.Additionally,one might facesituationswherethe numberofsamplesinX1 is verysmall.
Thispaperconstitutesafirstapproachtothisspecificproblemand wefocushereinonthe mostnaturalapproach,namelymaximum likelihoodestimation. The objectiveis to figureout thepros and consof the latterand the conditionsunder which it is an accu- rateestimator.The paperisorganizedasfollows.Insection 2we formulate thestatistical assumptions:more precisely, we assume thatR11/2R−21R11/2|R1 isa random matrixwitha Wishart distribu- tion around the identity matrix, and we derive the joint distri- bution of (X1, X2). Section 3 is devoted to the derivation of the maximumlikelihoodestimatorofR1from(X1,X2),takingintoac- countthepossibleconfigurationsregardingthenumberofsamples ineachdataset.Numericalsimulationsillustratetheperformance oftheMLEandcompareitwithexistingalternatives insection 4. Conclusionsandpossibleextensionsofthepresentworkaredrawn insection5.
2. Datamodel
Let us assume that we have two sets of measurements X1(p×n1) and X2(p×n2) which are distributed according to X1=d N(0,R1,I) and X2 d
=N(0,R2,I) where N(0,,) de- notes the matrix-variate normal distribution whose density is (2π)−pn/2||−n/2||−p/2etr{−12XT−1X−1} with |.|the determi-
nant and etr{.} the exponential of the trace of a matrix. Note that we consider real-valued data here whereas in radar appli- cations it is customary to consider complex-valued signals. In Appendix A we show how the results below can be readily ex- tended to the complex case. Our goal in this paper is to esti- mateR1, usingboth X1 andX2 even if R1=R2.However we as- sumethat the two matrices are close to each other. In orderto define a model that can reflect the proximity between R1 and R2, we note that the natural distancebetween them is givenby d2(R1,R2)=p
k=1log2λk(GT1R−21G1)[37,38]whereG1 isasquare- root of R1, i.e., R1=G1GT1 and λk(GT1R−12 G1) stands for the kth eigenvalue of GT1R−21G1. This matrix is pivotal in adaptive detec- tionproblemsalso.Moreprecisely,inthecaseofacovariancemis- matchbetweenthetrainingsamplesandthedataundertest,itis shownin [39] that the performance of the well-known adaptive matchedfilterdependsessentiallyonthismatrix.Therefore,itbe- comes natural to encapsulate the difference between R1 and R2 through the matrixW=GT1R−21G1 and its proximity to the iden- tity matrix. There are of course different ways to translate this constraintinthe model.Forinstance afrequentist approach may beadvocated wherethe jointprobability densityfunction of(X1,
X2) would be maximized under the constraintthat the distance between W and I is smaller than some value. Alternatively, and thisis what we elect here, one can resortto an empirical Bayes approach wherethe randommatrix W followssome prior distri- butionratherconcentratedaroundI.Formathematicaltractability, we choose a conjugate prior for W and we assume that W fol- lowsaWishartdistributionwithνdegreesoffreedomandparam-
etermatrixμ−1I,i.e.,W=dWp ν,μ−1I
.Ofcourse,thisisarather strongassumptionwhosevaliditywouldbedifficulttocheck,e.g., on realdata. However, it is inaccordance withthe mere knowl- edgewehaveabouttherelationbetweenR1andR2,anditallows fortractablederivations.
UsingthefactthatX1|R1andX2|R2areindependentandGaus- sian distributed with respective covariance matrices R1 and R2, andsinceR2=G1W−1GT1,wethusassumethefollowingstochastic model:
p(X1,X2|R1,W)=(2π)−p(n1+n2)/2|R1|−n1/2W−1R1−n2/2
×etr
−1
2XT1R−11X1−1
2XT2G−1TWG−11X2
(1a)
p(W)= μνp/2
2νp/2p(ν/2)|W|(ν−p−1)/2etr
−1 2μW
(1b) Note that E
W−1 =(ν−p−1)−1μI so that E{R2}= E
G1W−1GT1 =(ν−p−1)−1μR1:therefore,forE{R2}tobeequal
toR1,one mustselectμ=ν−p−1.Observealso thatW comes closerto I asν grows large.Indeed, E{W}=ν(ν−p−1)−1I and E
(W−E{W})2 =pν(ν−p−1)2I which goesto zero asν→∞ [40].
Themarginaldistributionof(X1,X2) isobtainedbyintegrating (1)withrespecttoW,whichresultsin
p(X1,X2|R1)=
W>0
p(X1,X2|R1,W)p(W)dW
=(2π)−p(n1+n2)/2μνp/2
2νp/2p(ν/2) |R1|−(n1+n2)/2etr
−1
2XT1R−11 X1
×
W>0|W|(ν+n2−p−1)/2etr
−1 2W
μI+G−11 X2XT2G−T1
dW
=(2π)−p(n1+n2)/2μνp/2
2νp/2p(ν/2) 2(ν+n2)p/2p((ν+n2)/2)
×|R1|−(n1+n2)/2μI+G−11 X2XT2G−T1 −(ν+n2)/2etr
−1
2XT1R−11 X1
=(2π)−pn1/2|R1|−n1/2etr
−1
2XT1R−11 X1
×π−pn2/2p((ν+n2)/2)
p(ν/2) |μR1|−n2/2I+XT2[μR1]−1X2−(ν+n2)/2
(2) Inordertoobtainthethirdequality,wemadeuseofthefactthat, ifS=dWp(ν,),
S>0
p(S)dS=1⇒
S>0|S|(ν−p−1)/2etr
−1 2S−1
dS
=2νp/2p(ν/2)||ν/2 (3)
Note that p(X1, X2|R1) in (2) can be factored as p(X1,X2|R1)=f1(X1,R1)×f2(X2,R1) which shows that X1
and X2 are marginally independent and that p(X1,X2|R1)= p(X1|R1)p(X2|R1) with p(X1|R1)∝etr
−12XT1R−11 X1 and p(X2|R1)∝I+XT2[μR1]−1X2−(ν+n2)/2.Duetothemodeladopted fortherandommatrixW=GT1R−21G1,X2 followsamatrixvariate Student distribution [41]. Therefore, the fact that R2=R1 results here in to two data sets with different distributions: one set