Maximum likelihood covariance matrix estimation from two possibly mismatched data sets

(1)

an author's

https://oatao.univ-toulouse.fr/25984

https://doi.org/10.1016/j.sigpro.2019.107285

Besson, Olivier Maximum likelihood covariance matrix estimation from two possibly mismatched data sets. (2020)

Signal Processing, 167. 107285-107294. ISSN 0165-1684

(2)

Maximum

likelihood

covariance

matrix

estimation

from

two

possibly

mismatched

data

sets

Olivier

Besson

ISAE-SUPAERO, 10 Avenue Edouard Belin, Toulouse 31055, France

Keywords:

Covariance matrix estimation Maximum likelihood Mismatch

a

b

s

t

r

a

c

t

Weconsiderestimatingthecovariancematrixfromtwodatasets,onewhosecovariancematrixR1isthe soughtoneandanothersetofsampleswhosecovariancematrixR2slightlydiffersfromthesoughtone, duee.g.todifferentmeasurementconﬁgurations.Weassumehoweverthatthetwomatricesarerather close, whichweformulatebyassuming thatR1₁/2R₂−1R1₁/2|R1 followsaWishartdistributionaround the identitymatrix.Itturnsoutthatthisassumptionresultsintwodatasetswithdifferentmarginal distri-butions,hencetheproblembecomesthatofcovariancematrixestimationfromtwodatasetswhichare distribution-mismatched.Themaximumlikelihoodestimator(MLE)isderived andisshowntodepend onthevaluesofthenumberofsamplesineachset.Weshowthatitinvolveswhiteningofonedataset bytheotherone,shrinkage ofeigenvaluesand colorization,atleastwhenonedatasetcontainsmore samplesthanthesizepoftheobservationspace.Whenbothdatasetshavelessthanpsamplesbutthe totalnumberislargerthanp,theMLEagainentailseigenvaluesshrinkagebutthistimeafteraprojection operation.Simulationresultscomparethenewestimatortostate of the art techniques.

1. Problemstatement

Analysisorprocessingofmultichanneldatamostoftenrelieson thecovariancematrix,whichisafundamentaltoole.g.,for princi-palcomponentanalysis,spectralanalysis, adaptiveﬁltering, detec-tion,directionofarrivalestimationamongothers[1–3].Inpractical applications,the p× p covariancematrix Rneeds tobe estimated froma ﬁnitenumbernofsamples.When thelatterare indepen-dentandGaussiandistributed,themaximumlikelihoodestimator ofRisn−1S whereXisthe p× n datamatrixandS=XXT _is _the

samplecovariancematrix(SCM)[1].However,inlowsample sup-port orwhen deviationfromtheGaussian assumptionisathand, theSCMtendstobehavepoorly.Inparticularitwasobservedthat thesamplecovariancematrixisusuallylesswell-conditionedthan the true covariancematrix, and thereforeconsiderable efforthas been dedicatedto regularizingit withaview to improveits per-formance.

One ofthe mostimportantapproach in thisrespectis dueto Stein [4–6] who, instead of maximizing the likelihood function, advocated tominimize a meaningfullossfunction within agiven classofestimators.Steinhenceintroducedtheconceptof admissi-bleestimationandminimaxestimatorsundertheso-calledStein’s loss.HeshowedthattheSCM-basedestimatorisnotminimaxand

E-mail address: [email protected]

derivedminimax estimators intwoimportantclasses,namely es-timatorsofthe formRˆ=GDGT _where_D_is_a_diagonal_matrix_and

Gisthe CholeskyfactorofS,orof theformRˆ=Udiag

ϕ

(

λ

)

UT

whereUdiag(

λ

)UT_is_the_eigenvalue_{decomposition}_of_S_and

_ϕ

₍

_λ

₎

isanon-linearfunction of

λ

.ThisseminalworkofSteingaverise to a great number of studies, see forinstance [7–13] and refer-encestherein.Asecond classofrobust estimatesisbasedon lin-ear shrinkageof the SCMto a target matrix(an approach which can be interpreted as an empirical Bayes technique), i.e., esti-matesof theformRˆ₌

α

Rt+

β

S whereRt=I isthe mostwidely

spread choice, see e.g., [14–20]. Note that these techniques ap-plied with Rt=I achieve an aﬃne transformation of the

eigen-values of S, while retaining the eigenvectors, andtherefore bear resemblancewith Stein’smethod, although theselection of

α

,

β

may not be driven by the same principle. Robustness to a pos-sibly non Gaussian distribution has also been a topic of consid-erable interest andmany papers havefocused on robust estima-tionforellipticallydistributeddata,seee.g.,[21–30]andreferences therein.

Mostof the above cited works deal with estimation of a co-variancematrixfromasingledataset.Inthispaper,weconsidera situationwheretwodatasetsX1andX2areavailable,with

respec-tive covariancematricesR1 andR2.Thissituationtypically arises

inradarapplicationswhenonewishestodetectatargetburiedin clutter with unknown statistics [31,32].In order to infer the lat-ter,trainingsamplesaregenerallyused,whichhopefullysharethe

(3)

samestatisticsastheclutterinthecellundertest(CUT).However, it has been evidenced that clutter is most often heterogeneous

[31],withadiscrepancycomparedtotheCUTthatmaygrowwith the distance to the CUT [33]. Therefore, one is led to use some clustering that separates training samples, either based on their proximity to the CUT or by means of some statistical criterion, suchasthepowerselectedtraining[34].Thesamplessoselected are deemed to be representative ofthe clutter inthe CUT while othersarelessreliable,whichcorrespondstothesituation consid-eredherein. A second example is in the ﬁeld of synthetic aper-tureradarinthecasewhereasceneisimagedontwoconsecutive days,withpossiblechangesinbetween[35].Finally,in hyperspec-tralimagery,theproblemoftarget oranomaly detectionleadsto averysimilarframework.Indeed,thebackgroundinapixelunder testhastobeestimatedfromthelocalpixelsaroundandpixels lo-catedfurtherapart[36].Inthepresentpaper,we assumethat R2

isclosetoR1,thecovariancematrixwewishtoestimate.SinceR2

differs frombutis closeto R1 we investigateusing both X1 and X2 toestimateR1.ThereasonforusingalsoX2 isthatdespiteits

covariancematrixisnot R1,itiscloseto.Additionally,one might

facesituationswherethe numberofsamplesinX1 is verysmall.

Thispaperconstitutesafirstapproachtothisspecificproblemand wefocushereinonthe mostnaturalapproach,namelymaximum likelihoodestimation. The objectiveis to figureout thepros and consof the latterand the conditionsunder which it is an accu-rateestimator.The paperisorganizedasfollows.Insection 2we formulate thestatistical assumptions:more precisely, we assume thatR1₁/2R−1₂ R₁1/2

|

R1 isa random matrixwitha Wishart

distribu-tion around the identity matrix, and we derive the joint distri-bution of (X1, X2). Section 3 is devoted to the derivation of the

maximumlikelihoodestimatorofR1from(X1,X2),takinginto

ac-countthepossibleconﬁgurationsregardingthenumberofsamples ineachdataset.Numericalsimulationsillustratetheperformance oftheMLEandcompareitwithexistingalternatives insection 4. Conclusionsandpossibleextensionsofthepresentworkaredrawn insection5.

2. Datamodel

Let us assume that we have two sets of measurements

X1(p× n1) and X2(p× n2) which are distributed according to X1=d N

(

0,R1,I

)

and X2=d N

(

0,R2,I

)

where N

(

0,

,

)

de-notes the matrix-variate normal distribution whose density is

(

2

π

)

−pn/2

|

−n/2

|

−p/2etr

{

−1

2XT

−1X

−1

}

with |.|the

determi-nant and etr{.} the exponential of the trace of a matrix. Note that we consider real-valued data here whereas in radar appli-cations it is customary to consider complex-valued signals. In

Appendix A we show how the results below can be readily

ex-tended to the complex case. Our goal in this paper is to esti-mateR1, usingboth X1 andX2 even if R1=R2.However we

as-sumethat the two matrices are close to each other. In orderto deﬁne a model that can reﬂect the proximity between R1 and R2, we note that the natural distancebetween them is givenby

d2

₍

_R

1,R2

)

=kp=1log 2

_λ

k

(

GT1R−12 G1

)

[37,38]whereG1 isa

square-root of R1, i.e., R1=G1GT1 and

λ

k

(

GT1R−12 G1

)

stands for the kth

eigenvalue of GT

1R−12 G1. This matrix is pivotal in adaptive

detec-tionproblemsalso.Moreprecisely,inthecaseofacovariance mis-matchbetweenthetrainingsamplesandthedataundertest,itis shownin [39] that the performance of the well-known adaptive matchedﬁlterdependsessentiallyonthismatrix.Therefore,it be-comes natural to encapsulate the difference between R1 and R2

through the matrixW=GT

1R−12 G1 and its proximity to the

iden-tity matrix. There are of course different ways to translate this constraintinthe model.Forinstance afrequentist approach may beadvocated wherethe jointprobability densityfunction of(X1,

X2) would be maximized under the constraintthat the distance

between W and I is smaller than some value. Alternatively, and thisis what we elect here, one can resortto an empirical Bayes approach wherethe randommatrix W followssome prior distri-butionratherconcentratedaroundI.Formathematicaltractability, we choose a conjugate prior for W and we assume that W fol-lowsaWishartdistributionwith

ν

degreesoffreedomand param-etermatrix

μ

−1I,i.e.,W=dWp

ν

,

μ

−1I

.Ofcourse,thisisarather

strongassumptionwhosevaliditywouldbediﬃculttocheck,e.g., on realdata. However, it is inaccordance withthe mere knowl-edgewehaveabouttherelationbetweenR1andR2,anditallows

fortractablederivations.

UsingthefactthatX1|R1andX2|R2areindependentand

Gaus-sian distributed with respective covariance matrices R1 and R2,

andsinceR2=G1W−1GT1,wethusassumethefollowingstochastic

model: p

(

X1,X2

|

R1,W

)

=

(

2

π

)

−p(n1+n2)/2

|

R1

|

−n1/2

W−1R1

−n2/2 ×etr

−1 2X T 1R−11 X1− 1 2X T 2G−T1 WG−11 X2

(1a) p

(

W

)

=

μ

νp/2 2νp/2

p

(

ν

/2

)

|

W

|

(ν−p−1)/2etr

−1 2

μ

W

(1b)

Note that _E

W−1

₌

(

ν

_{− p}_{− 1}

)

−1

μ

I so that _E

{

R2

}

=

E

G1W−1GT1

=

(

ν

− p− 1

)

−1

_μ

_R₁_:_therefore,_for_E

_{

_R₂

_}

_to_be_equal

toR1,one mustselect

μ

=

ν

− p− 1.Observealso thatW comes

closerto I as

ν

grows large.Indeed, _E

{

W

}

₌

ν

(

ν

_{− p}_{− 1}

)

−1I and E

(

W− E

{

W

}

)

2

=p

ν

(

ν

− p− 1

)

2_I _which _goes_to _zero _as

_ν

_→_∞

[40].

Themarginaldistributionof(X1,X2) isobtainedbyintegrating

(1)withrespecttoW,whichresultsin

p

(

X1,X2

|

R1

)

= W>0 p

(

X1,X2

|

R1,W

)

p

(

W

)

dW =

(

2

π

)

−p(n1+n2)/2

μ

νp/2 2νp/2

p

(

ν

/2

)

|

R1

|

−(n1+n2)/2etr

−1 2X T 1R−11 X1

× W>0

|

W

|

(ν+n2−p−1)/2_etr

₋1 2W

μ

I+G−1₁ X2XT2G−T1

dW =

(

2

π

)

−p(n1+n2)/2

μ

νp/2 2νp/2

p

(

ν

/2

)

2(ν+n2)p/2

p

((

ν

+n2

)

/2

)

×

|

R1

|

−(n1+n2)/2

μ

I+ G−11 X2XT2G−T1

−(ν+n2)/2 etr

−1 2X T 1R−11 X1

=

(

2

π

)

−pn1/2

|

_R 1

|

−n1/2etr

−1 2X T 1R−11 X1

×

π

−pn2/2

p

((

ν

+n2

)

/2

)

p

(

ν

/2

)

|

μ

R1

|

−n2/2

I+XT2[

μ

R1]−1X2

−(ν+n2)/2 (2)

Inordertoobtainthethirdequality,wemadeuseofthefactthat, ifS=dWp

(

ν

,

)

, S>0 p

(

S

)

dS=1⇒ S>0

|

S

|

(ν−p−1)/2etr

−1 2S

−1

dS =2νp/2

p

(

ν

/2

)

|

ν/2 (3)

Note that p(X1, X2|R1) in (2) can be factored as

p

(

X1,X2

|

R1

)

=f1

(

X1,R1

)

× f2

(

X2,R1

)

which shows that X1

and X2 are marginally independent and that p

(

X1,X2

|

R1

)

=

p

(

X1

|

R1

)

p

(

X2

|

R1

)

with p

(

X1

|

R1

)

∝etr

−1 2XT1R−11 X1

and p

(

X2

|

R1

)

∝

I+XT2[

μ

R1]−1X2

−(ν+n2)/2

.Duetothemodeladopted fortherandommatrixW=GT

1R−12 G1,X2 followsamatrixvariate

Student distribution [41]. Therefore, the fact that R2=R1 results

(4)

X1 is Gaussian distributed with covariance matrix R1 while the

uncertaintyin R2 leadstoa StudentdistributionforX2.Thisisa

ratheroriginalsituationwhereonehastocarrycovariancematrix estimationfromtwodatasetswhicharemismatchedintheir dis-tributions. Thispeculiarity will resultin new schemes compared totheconventionalcaseofasinglesetwithgivendistribution,as detailednow.

3. Maximumlikelihoodestimation

Inthissection we addressestimation ofR1 from(X1,X2) and

we focusonthemostnaturalestimator,i.e., themaximum likeli-hood estimator.From (2),the log-likelihoodfunction is,up to an additiveandconstantterm

f(R1)=− n1+n2 2 log|R1|− ν+n2 2 logI+μ −1_R−1 1 S2− 1 2Tr R−1₁ S1 =ν− n1 2 log|R1|− ν+n2 2 logR1+μ−1S2− 1 2Tr R−1₁ S1 (4)

where S1=X1XT1 and S2=X2XT2. Differentiating the previous

equation andusingthefact thatd

|

R

|

=

|

R

|

Tr

R−1dR

anddR−1= −R−1

₍

_dR

₎

_R−1_,_we_obtain_the_following_equation_that_the_ML

solu-tionshouldsatisfy

(

ν

− n1

)

R−11 −

(

ν

+n2

)

R1+

μ

−1S2

−1

+R−1₁ S1R−11 =0 (5)

Inordertosolve(5),wemustinvestigatevariousconﬁgurationsfor (n1,n2) asthesolutionwilldepend onthem.Before goingtothe

technicaldetails of eachcase, we give an overviewoftheresults obtained.

3.1. Summaryofresults

As is illustrated below, the expression of the maximum like-lihood estimator depends on the respective valuesof n1 and n2.

In the sequel three cases will be distinguished: a ﬁrst situation where n1<pandn2≥ p, asecond one whichis themirror

situa-tion,namelyn1≥ pandn2<p,andﬁnallyathirdmorechallenging

casewheren1<p,n2<pandn1+n2≥ p.

Intheﬁrst[respectivelysecond]case, theMLsolutionisgiven by (11) [resp.(21)]: itwill be shownthat theestimation process entails whitening of X1 [resp. X2] by the inverse of the

square-rootofthesamplecovariancematrixofX2 [resp.X1],followedby

shrinkageofeigenvaluesandﬁnallycolorizationbythesquare-root ofthesamplecovariancematrixofX2[resp.X1].Thetechnique of

eigenvalue shrinkageisrather well knownbutusually applied to the SCMofa singleset: herein,due tothe presenceof twodata sets, this technique is applied to one data set after it has been whitened by the second one. Interestingly enough, the ML solu-tioncan alsobe writtenas(14)[resp.(22)],thatisasaweighted sumof theSCM ofeach dataset, wherethe weighting matrixis diagonal for one set of samples, andnon diagonal for the other set.

Finally, when n2<p, n1<p and n1+n2≥ p, the procedure

includes a partitioning between the subspace spanned by the columns of X2 and its orthogonal complement. In the former,

shrinkageofeigenvaluesis usedwhile,inthelatter,projectionof theSCMofX1isretained.

3.2. Casen1<pandn2≥ p

We consider ﬁrst the casewhere n1<p and n2≥ p, i.e., n1 is

not large enough forS1 to be positive deﬁnite andone needs to

use X2 inorderto estimate R1, eventhough R2=R1. Eq.(5)can

berewrittenas

(

ν

− n1

)

R−11

R1+

μ

−1S2

−

(

ν

+n2

)

I+R−11 S1R−11

R1+

μ

−1S2

=0 ⇒−

(

n1+n2

)

I+

(

ν

− n1

)

μ

−1R−11 S2+R−11 S1+

μ

−1R−11 S1R−11 S2=0 ⇒−

(

n1+n2

)

R1S−12 R1+

(

ν

− n1

)

μ

−1R1+S1S2−1R1+

μ

−1S1=0 ⇒R1S−12 R1−

_ν

_{− n} 1

μ

(

n1+n2

)

I+_n 1 1+n2 S1S−12

R1 −

_μ

₍

_n1 1+n2

)

S1=0 (6)

LetS2=L2LT2 andletusdeﬁneR˜12=L2−1R1L−T2 andS˜1=L−12 S1L−T2 .

Then, pre-multiplying the previous equation by L−1₂ and post-multiplyingitbyL−T₂ _,weobtain ˜ R2 12−

_ν

_{− n}₁

μ

(

n1+n2

)

I+ 1 n1+n2 ˜ S1

˜ R12− 1

μ

(

n1+n2

)

˜ S1=0 (7)

Let w be an eigenvector of R˜12 associated with eigenvalue

ξ

.

Then,

ξ

2_w₋

_ξ

ν

− n1

μ

(

n1+n2

)

I+ 1 n1+n2 ˜ S1

w− 1

μ

(

n1+n2

)

˜ S1w=0 ⇒

1

μ

(

n1+n2

)

+

ξ

n1+n2

˜ S1w=

ξ

−

ν

− n1

μ

(

n1+n2

)

w (8)

whichimpliesthat wisalso aneigenvector ofS˜1.Either itis

as-sociatedwithazeroeigenvalue(thereare p− n1 ofthem) and,in

thiscase,

ξ

= ν−n1

μ(n1+n2),orit is associatedwith astrictly positive

eigenvalue

λ

and

ξ

satisﬁesthesecond-orderpolynomialequation

ξ

2₋

_ξ

ν

− n1

μ

(

n1+n2

)

+

λ

n1+n2

−

_μ

₍

λ

n1+n2

)

= 0 (9)

The above polynomial has obviously two real-valued roots, one being negative, the other being positive, and thus the latter is the eigenvalue of R˜12. To summarize, if we let L−12 X1=U

VT=

n1

k=1

σ

kukvTk be the singular value decomposition of L−12 X1, we

have ˜ R12= n1 k=1

ξ

kukuTk+

ν

− n1

μ

(

n1+n2

)

p k=n1+1 ukuTk = n1 k=1

ξ

k−

_μ

₍

ν

− n1 n1+n2

)

ukuTk+

ν

− n1

μ

(

n1+n2

)

I (10)

where

ξ

_kisthepositiverootof(9)with

λ

substitutedfor

σ

2

k.The MLEofR1 isthus R1= n1 k=1

ξ

k−

_μ

₍

ν

_n− n1 1+n2

)

L2ukuTkL T 2+

ν

− n1

μ

(

n1+n2

)

S2 (11)

Itisinstructivetostudytheformofthissolution.Theoriginaldata

X1 isﬁrst adaptively whitened by L−12 and its samplecovariance

matrixiscomputed.Theeigenvectorsofthelatterareretainedand theeigenvaluesaremodiﬁed. Then,dataisre-coloredby L2.Note

thatthetechniqueofregularizingeigenvalueswhilekeeping eigen-vectors is classical in robust covariance matrix estimation. How-ever,thistechniqueusually appliesto onesetofsamples.Here it appliestoone setofsamplesafterithasbeen“whitened” bythe other set.Indeedawhitening-colorizationoperation isperformed pre and posteigenvalues modiﬁcation. Anotherimportant obser-vationisthatthetransformation

λ

_→

ξ

preservestheorderofthe eigenvalues,an importantissueinStein’sestimation using eigen-value decomposition [42–44]. Thiscan be seen by differentiating

(9)withrespectto

λ

,whichgives

∂ξ

∂λ

2

ξ

−

ν

− n1

μ

(

n1+n2

)

−

λ

n1+n2

=

ξ

n1+n2 + 1

μ

(

n1+n2

)

(12)

Since the bracketed term on the left-hand side of the previous equation is positive, it follows that

∂

ξ

/

∂

λ

>0 and therefore the transformationpreservesorderingoftheeigenvalues.Thisproperty willholdtrueintheothercasesdevelopedbelow.

(5)

AcommentisalsoinorderregardingthebehavioroftheMLE when

ν

grow large,i.e., when W comes closertoI. Indeed,with

μ

=

ν

− p− 1,onehas lim ν→∞

ξ

k= 1+

λ

k n1+n2 ⇒ lim ν→∞R˜12= 1 n1+n2

_˜ S1+I

⇒ lim ν→∞R1= 1 n1+n2 L2

L−1₂ S1L−T₂ +I

LT 2 =_n 1 1+n2 [S1+S2] (13)

whichshowsthat,asWcomesclosertoI,i.e.,asR2 comescloser

to R1, the MLE is simply the sample covariance matrix of the

wholedata,asmaybeexpected.

Finally, another interpretation of theMLE can be obtainedby rewritingtheMLEinanotherform.Notingthattherangespaceof

L−1₂ X1coincides with the range space of u1,. . .,un₁, it follows that

uk=L−12 X1

η

kforsome vector

η

k.Therefore,(11)canbe rewritten

as R1=X1

n1 k=1

ξ

k−

ν

− n1

μ

(

n1+n2

)

η

k

η

Tk

XT 1+

ν

− n1

μ

(

n1+n2

)

S2 =X1

1XT1+

ν

− n1

μ

(

n1+n2

)

X2XT2 (14)

Consequently,theMLEisaweightedversionofthesample covari-ancematricesofeachdataset.In fact,itcanbe shown(weomit thedetails)thatifasolutionto(5)issought whichisoftheform

(14),then

1 issolutiontotheequation

2 1+

_ν

_{− n} 1

μ

(

n1+n2

)

XT 1S−12 X1

−1 − 1 n1+n2 I

1 −

ν

+n2

μ

(

n1+n2

)

2

XT 1S−12 X1

−1 =0 (15)

Itensuesthat

1 andXT1S−12 X1sharethesameeigenvectors,which

are indeed the rightsingular vectors vk of L−12 X1. Moreover, the

eigenvalues

γ

_kof

1satisfy

γ

2 k +

γ

k

₍

_ν

_{− n} 1

)

σ

k−2

μ

(

n1+n2

)

− 1 n1+n2

−

(

ν

+n2

)

σ

k−2

μ

(

n1+n2

)

2 = 0 (16)

To summarize, the MLE of R1 can either be written as in

(11)wheretheeigenvalues

ξ

_karerelatedtotheeigenvalues

λ

_kof

L−1₂ S1L−T2 by(9),orasin(14)where

1isgivenby(15).

3.3.Casen2<pandn1≥ p

We now consider a situation where n2<p and n1≥ p under

whichonehasasuﬃcientnumberof“good” samplesX1 forS1to

befull-rank.Yet,itmightbeofinteresttouseX2 eventhoughits

covariancematrixR2=R1.ThederivationoftheMLEfollowsalong

thesamelinesasinthepreviouscase,exceptthatnowS2is

rank-deﬁcientandS1 isfull-rank. StartingfromtheML Eq.(5),oncan

write

(ν

− n1

)

R−11

R1+

μ

−1S2

−

(ν

+n2

)

I+R−11 S1R−11

R1+

μ

−1S2

=0 ⇒−

(

n1+n2

)

I+

(ν

− n1

)

μ

−1R−11 S2+R−11 S1+

μ

−1R−11 S1R−11 S2=0 ⇒−

(

n1+n2

)

R1S−11 R1+

(ν

− n1

)

μ

−1R1S−11 S2+R1+

μ

−1S2=0 ⇒R1S−11 R1− R1

(ν

− n1

)

μ(

n1+n2

)

S−1₁ S2+ 1 n1+n2 I

−

_μ(

1 n1+n2

)

S2=0 (17)

LetS1=L1LT1andletusdeﬁneR˜11=L1−1R1L−T1 andS˜2=L−11 S2L−T1 .

Then, taking the transpose of the previous equation, pre-multiplyingbyL−1₁ andpost-multiplyingbyL−T₁ _,weobtain

˜ R2 11−

(

ν

− n1

)

μ

(

n1+n2

)

˜ S2+ 1 n1+n2 I

˜ R11− 1

μ

(

n1+n2

)

˜ S2=0 (18)

As before, it can be seen that R˜11 andS˜2 share the same

eigen-vectors. The p_{− n}2 eigenvectorsofS˜2 associatedwithzero

eigen-value will correspond to a constant eigenvalue for R˜11 equal to

(

n1+n2

)

−1.Astrictlypositiveeigenvalue

ζ

ofR˜11 isrelatedtoits

counterpart

λ

ofS˜2by

ζ

2₋

_ζ

λ

(

ν

− n1

)

μ

(

n1+n2

)

+ 1 n1+n2

−

λ

μ

(

n1+n2

)

=0 (19)

Now,ifweletL−1₁ X2=Y

ZT=n_k₌₁2

θ

kykzTk be thesingularvalue

decompositionofL−1₁ X2,wehave ˜ R11= n2 k=1

ζ

kykyTk+ 1 n1+n2 p k=n2+1 ykyTk = n2 k=1

ζ

k− 1 n1+n2

ykyTk+ 1 n1+n2 I (20)

where

ζ

kisthepositiverootof(19)with

λ

substitutedfor

θ

k2.The

MLEofR1becomes R1= n2 k=1

ζ

k− 1 n1+n2

L1ykyTkLT1+ 1 n1+n2 S1 (21)

Again, since the range space of L−1₁ X2 is spanned by y1,...,yn₂,

onehasyk=L−11 X2

χ

kandhence

R1=X2

_n 2 k=1

ζ

k− 1 n1+n2

χ

k

χ

Tk

XT2+ 1 n1+n2 X1XT1 =X2

2XT2+ 1 n1+n2 X1XT1 (22)

Notethat (22)differs from(14) inthat theweighting matrix ap-pliedbetweenX1 andXT1 isnow diagonalwhilethat applied

be-tweenX2 andXT2 isnolongerdiagonal.Furthermore,ifone looks

forasolutionoftheform(22)then

2 isthesolutionto

2 2+

(

XT 2S−11 X2

)

−1 n1+n2 −

(

ν

− n1

)

μ

(

n1+n2

)

I

2 −

ν

+n2

μ

(

n1+n2

)

2

(

XT2S−11 X2

)

−1=0 (23)

2andXT2S−11 X2sharethesameeigenvectors(actuallyzk)andthe

eigenvalue

γ

kof

2isobtainedasthepositivesolutionto

γ

2 k +

γ

k

_θ

₋₂ k n1+n2−

(

ν

− n1

)

μ

(

n1+n2

)

−

θ

k−2

(

ν

+n2

)

μ

(

n1+n2

)

2 = 0 (24)

Remark 1. When n1≥ p andn2≥ p, the previous techniques can

stillbeused,withslightvariations.Inthiscase,S˜1andS˜2arenow

full-rank, andtherefore the MLE ofR1 is given by (11) but with

theﬁrst sumextendedto peigenvectors(S˜1 has nowpnon-zero

eigenvalues),andthesecondtermvanishes.TheMLsolutionisalso givenby (21) withtheﬁrst termextended to peigenvectorsand thesecondtermvanishing.

3.4. Casen1<p,n2<pandn1+n2≥ p

We nowconsider themore challengingcasewhere neitherof the two data sets contains enough samples for their respective samplecovariance matrices to be full rank, andthus it becomes mandatorytocombinebothsets.Thissituationisabittrickierand requiressomecarefulness.Goingbackto(5),theMLEofR1should

satisfy

(

ν

− n1

)

R−11 −

(

ν

+n2

)

R1+

μ

−1S2

−1 +R−1₁ S1R−11 =0 ⇒

(

ν

− n1

)

R−11 +R−11 S1R−11

(6)

−

(

ν

+n2

)

R−1₁ −

μ

−1_R−1 1 X2

I+

μ

−1_XT 2R−11 X2

−1 XT2R−11

=0 ⇒

(

n1+n2

)

R1=

(

ν

+n2

)

μ

−1X2

I+

μ

−1_XT 2R−11 X2

−1 XT 2+X1XT1 (25)

Before pursuing, itis worthylooking atthe previous equation to get some insight. We observe that the projection of R1 onto the

subspace orthogonal to X2 will be equal to the projection of S1

onthissamesubspace.Thissuggeststouseadecompositionthat splits data in _R

(

X2

)

and its orthogonal complement. To do so,

letusconsidertheSVDofX2 asX2=CDET=

Ca Cb

Da

0

ET₌

CaDaET, where C is p× p, Ca is p× n2 and Da is the n2× n2

di-agonal matrix ofsingularvalues. Letusalso operatea changeof coordinatesanddeﬁne

=CTR1C=

CT aR1Ca CTaR1Cb CT bR1Ca CT_bR1Cb

=

aa

ab

ba

bb

(26)

With these deﬁnitions, it is straightforward to show that

XT

2R−11 X2=EDa

−1_a_.bDaET where

_a.b=

aa−

ab

−1bb

ba and

thus X2

I+

μ

−1_XT 2R−11 X2

−1 XT2=CaDa

I+

μ

−1Da

−1_a_.bDa

−1DaCTa =Ca

D−2a +

μ

−1

−1a.b

−1 CT a (27)

Therefore,pre-multiplying(25)byCT_and_{post-multiplying}_it_by_C_,

weobtain

(

n1+n2

)

aa

ab

ba

bb

=

(

ν

+n2

)

μ

−1

D−2a +

μ

−1

−1a.b

−1 0 0 0

+

CT aS1Ca CTaS1Cb CT bS1Ca C T bS1Cb

(28)

whichimmediatelyimpliesthat

(

n1+n2

)

ba=CTbS1Ca

(

n1+n2

)

bb=CTbS1Cb (29)

ThiscorroboratesthecommentswemadeafterEq.(25)sinceone has

(

n1+n2

)

Cb

bbCTb=

(

n1+n2

)

CbCTbR1CbCTb=P⊥X2R1P ⊥ X2 =CbCTbS1CbCTb=P⊥X2S1P ⊥ X2 (30)

Itnowremainstoﬁnd

aaorequivalently

a.b.Towardsthisend,

notethat

(

n1+n2

)

aa=

(

ν

+n2

)

μ

−1

D−2_a +

μ

−1

−1 a.b

−1 +CT aS1Ca (31) However,

(

n1+n2

)

aa=

(

n1+n2

)

a.b+

ab

−1bb

ba

=

(

n1+n2

)

a.b+

CTaS1Cb

CT bS1Cb

−1

CT bS1Ca

(32) whichleadsto

(

n1+n2

)

a.b=

(

ν

+n2

)

μ

−1

D−2a +

μ

−1

−1_a.b

−1 +

CT_S 1C

a.b (33)

For the sake of notational convenience, let us denote

F=

CT_S

1C

a.b. Post-multiplying the previous equation by

D−2a +

μ

−1

−1_a.b

resultsin (n1+n2)a.bD−2a − (ν− n1)μ−1I+FD−2a −μ−1_F−1 a.b=0 ⇒a.bD−2a a.b− _ν_{− n} 1 μ(n1+n2)I+ 1 n1+n2FD −2 a a.b−_μ(_n1 1+n2)F=0 ⇒˜2 a.b− _ν_{− n} 1 μ(n1+n2) I+_n 1 1+n2 ˜ F˜_a.b₋ 1 μ(n1+n2) ˜ F=0 (34)

where

˜a.b=D−1a

a.bD−1a andF˜=D−1a FD−1a .Similarlytowhatwas

done before,

˜_a.b and F˜ share the same eigenvectors. When the eigenvalue

λ

ofF˜ is zero(there areactually p− n1 ofthem [45])

thecorrespondingeigenvalue

φ

of

˜a.b is μ(νn−n₁+1n₂).Foreachofthe

r=n1+n2− p non-zero

λ

,thecorresponding

φ

istheunique

pos-itiverootof

φ

2₋

ν

− n1

μ

(

n1+n2

)

+

λ

n1+n2

φ

−

_μ

₍

λ

n1+n2

)

= 0 (35)

Therefore,ifu˜karetheeigenvectorsofF˜,

˜a.bisgivenby ˜

a.b= r k=1

φ

ku˜ku˜Tk+

ν

− n1

μ

(

n1+n2

)

p k=r+1 ˜ uku˜Tk = r k=1

φ

k−

ν

− n1

μ

(

n1+n2

)

˜ uku˜Tk+

ν

− n1

μ

(

n1+n2

)

I (36)

Once

˜a.b iscomputed,

a.b=Da

˜a.bDa and

aacanbe obtained

from(32).Finally,theMLEofR1 isgivenbyC

CT.

We now present an alternative way to compute the solution. From (25), it appears that R1 can be written as

(

n1+n2

)

R1= X1XT1+X2

2XT2 where

2=

(

ν

+n2

)

μ

−1

I+

μ

−1XT2R−11 X2

−1

. Let

X₌

X1 X2

and let XT₌_QR _be _the _QR _{decomposition} _of _XT

withQa

(

n1+n2

)

× p semi-unitary matrix,i.e., QTQ=Ip.Letus

partitionQasQ=

Q1 Q2

sothatXT

1=Q1RandXT2=Q2R.Then,one

has

(

n1+n2

)

R1=X1XT1+X2

2XT2 =RT

_QT 1Q1+QT2

2Q2

R andtherefore

(

n1+n2

)

−1XT2R−11 X2=Q2

QT 1Q1+QT2

2Q2

−1 QT 2 =Q2

I+QT2

(

2− I

)

Q2

−1 QT2 =Q2

I− QT 2

(

2− I

)

−1+Q2QT2

−1 Q2

QT 2 =Q2QT2− Q2QT2

(

2− I

)

−1+Q2QT2

−1 Q2QT2 =

Q2QT2

−1 +

2− I

−1 (37) Consequently,ifwedeﬁneB2=

Q2QT2

−1 − I

−1 2 =

(

ν

+n2

)

−1

μ

I+

μ

−1_XT 2R−11 X2

=

(

ν

+n2

)

−1

μ

I+

(

ν

+n2

)

−1XT2R−11 X2 =

(

ν

+n2

)

−1

μ

I+

(

ν

+n2

)

−1

(

n1+n2

)

(

2+B2

)

−1 (38)

Pre-multiplying the previous equation by

(

2+B2

)

and

post-multiplying by

2, weobtain the followingsecond-order

polyno-mialequation:

2 2+

B2−

ν

− n

_μ

1I

2−

ν

+ n2

μ

B2=0 (39)

It follows that

2 and B2 share the same eigenvectors. If

λ

is a

non-zeroeigenvalueofB2(therearen1+n2− p of them),then the

correspondingeigenvalue

γ

of

2istheuniquepositiveroottothe

followingpolynomialequation

γ

2₊

_λ

₋

ν

− n1

μ

γ

−

ν

+n2

(7)

Fig. 1. Average distance between ˆ R1 and R 1 in case 1.

(8)

If

λ

=0 then

γ

=

(

ν

− n2

)

μ

−1. Finally, the solution

2 is given by

2= r k=1

γ

kbkbTk+

ν

− n1

μ

n2 k=r+1

γ

kbkbTk = r k=1

γ

k−

ν

− n

_μ

1

bkbTk+

ν

− n1

μ

I (41)

whereb_karetheeigenvectorsofB2.Notethat

B2b=

λ

b⇒

(

Q2QT2

)

−1b− b=

λ

b

⇒

(

Q2QT2

)

−1b=

(

1+

λ

)

b

⇒

(

Q2QT2

)

b=

(

1+

λ

)

−1b

andhencebisaneigenvectorofQ2QT2 associatedwitheigenvalue

(

1+

λ

)

−1_, _or _equivalently_a _right _singular_vector _of _QT

2. Observe

alsothat,sinceXT

2=Q2RandXXT=RTQTQR=RTR,onehas

Q2QT2=XT2R−1R−TX2=XT2

(

RTR

)

−1X2

=XT

2

(

XXT

)

−1X2=XT2

(

X1XT1+X2XT2

)

−1X2

Hence,ifweletS=X1XT1+X2XT2=LLT,thenQT2 andL−1X2 share

thesamerightsingularvectors.

4. Numericalsimulations

In this section, we evaluate numerically the performance of the MLE presented above through Monte-Carlo simulations. We consider a scenario where the size of the observation space is

p=128.Threecaseswillbe considered forthe covariancematrix

R1, which correspond to different kind of processes. In the ﬁrst

casethe(k,)elementisR1

(

k,

)

=P

ρ

|k−|+

δ

(

k,

)

with

ρ

=0.7.

ThesecondcaseassumesthatR1

(

k,

)

=Pe−0.5(2πσf|k−|) 2

+

δ

(

k,

)

with

σ

_f₌0_.02. In the third case, R1

(

k,

)

=rAR

(

|

k− 

|

)

+

δ

(

k,

)

whererAR

(

|

k− 

|

)

correspondstothecorrelationofan

autoregres-siveprocesswhosepolesarelocatedat0.95e± i2π0.05_,_0.9_e± i2π0.15_,

0.9e± i2π0.18_. _Finally_P₌₁₀₀ _and _r

AR

(

0

)

=100. The corresponding

processes are rather lowpass in case1 and 2,while case 3 con-cernsprocesseswithsharppeaksintheirspectrum.Ineach simu-lationX1isgeneratedfromaGaussiandistributionwithcovariance

matrixR1.Then W isgeneratedfromaWishartdistribution with

ν

=p+2degreesoffreedomandparametermatrix

(

ν

− p− 1

)

−1_I

andR2iscomputedasR2=G1W−1GT1.ThenX2isgeneratedfrom

aGaussiandistributionwithcovariancematrixR2.

The MLE is compared with four competitors. The ﬁrst is the sample covariancematrixbased on all samples,i.e.,

(

n1+n2

)

−1S

where S₌X1XT1+X2X2T. The second is of the form GSCMDGTSCM,

where GSCM isthe Choleskyfactorof S, andDis a diagonal

ma-trix which is chosen to minimize Stein’s loss and is given by

D_k_,k₌1_/

(

n1+n2+p− 2k+1

)

.Thethird isofthesameformbut

is meant at minimizing the natural distance between R1 and

its estimate: as shown in [13], it amounts to choosing Dk,k=

exp

−E

log

χ

2

n1+n2−i+1

. Finally, we consider the class of or-thogonally invariant estimators of the form USCMdiag

ϕ

(

λ

)

UT SCM whereS₌USCMdiag

λ

UT

SCM istheeigenvaluedecompositionofS

and

ϕ

(

λ

)

=

ϕ

1

(

λ

)

...

ϕ

p

(

λ

)

.Stein showed that the choice

ϕ

k=

λ

k/

(

n1+n2− p+1+2

λ

kj=k

(

λ

k−

λ

j

)

−1

)

is the best with

respect to Stein’s loss. However this choice has two drawbacks: it can result in some

ϕ

_k_<0 and it does not preserve the order of theeigenvalues

λ

k, whichis aproblem [42].In orderto

over-come theseproblems,Stein proposed an isotonizingscheme that guarantees

ϕ

k>0 andpreservesorder, see[46] fordetails ofthis

scheme.We considerthisimprovedestimatorasthefourth alter-native. The ﬁgure of merit for all estimators will be the natural

(9)

Fig. 4. Average distance between ˆ R1 and R 1 in case 1 versus ν. n 1 + n 2 = 2 p.

distancebetweenthetrueandthe estimatedcovariancematrices

d2

₍

_R

1,Rˆ1

)

=_kp₌₁log2

λ

k

(

R−11 Rˆ1

)

.

The simulation resultsare shownin Figs. 1-4 where we con-siderdifferentvaluesforthetotalnumberofsamplesn₌n1+n2,

namelyn=p,n=3p/2andn=2p.Themainconclusions regard-ingthesesimulationsarethefollowing:

• the MLE is shown to outperform its competitors when n1 is

small and n is large enough, typically it has the best perfor-manceforn₌3p_/2andn₌2p. Onecanobservethat the im-provement achievedby the MLEismore importantwhen n= 2p andn1 issmall,i.e.,whenonehasveryfewsamplesdrawn

fromR1 andalargemajorityofsamplesdrawnfromR2.

• in contrast, when n=p the other methods can perform bet-ter thantheMLE,especiallywhenn1 isaboveathreshold,i.e.,

whenthenumberof“good” samplesislargeenough.

• among the Stein-like methods, that based on eigenvalue de-composition (with isotonizing) is the best, but the method based on Cholesky factorization and minimization of the geodesicdistancecomesveryclose.

Inaﬁnalsimulation,weevaluatetheinﬂuenceof

ν

:recallthat, as

ν

increases,WisclosertoI andthusR2 isclosertoR1,which

meansthat X2 shouldbenearlyasinformativeasX1.InFig.4we

display the average distance as a function of

ν

in case 1 with

n1+n2=2p. Itis observedthat, as

ν

increases, theperformance

ofallestimatorsimprove.TheproposedMLEisnolongerthemost accurate above a threshold,where it isdominated by the Stein’s estimatorbased on the eigenvalues of the whole sample covari-ancematrix.However,theproposedMLEstillperformsbetterthan allotherestimators.

5. Conclusions

Inthispaper,weconsideredtheproblemofestimatinga covari-ancematrix R1 fromtwo datasets, oneset X1 whosecovariance

matrixisactuallyR1 andanothersetX2whosecovariancematrix R2 isdifferentbutclosetoR1.SincethedistancebetweenR1 and R2dependsontheeigenvaluesofW=GT1R−12 G1,weembeddedthe

latterinastatisticalmodelandassumedthatitfollowedaWishart distributionaroundtheidentitymatrix.Weshowedthatthe prob-lemisthatofestimatingR1fromtwodatasetswithdifferent

dis-tributions.Themaximumlikelihoodestimatorwasderivedandits expressionwasshowntodependonthenumberofsamplesinX1

andX2. TheMLE wasshowntoperform quitewell, ascompared

to stateof theartalgorithms, atleastwhen thenumberof sam-ples in X1 is small and the total number of samples n is large

enough. However, as ina classical framework witha singledata set,thereisroomfromimprovementoftheMLE,especiallyinlow samplesupport.Therefore,futurework should bedevoted to im-proving the MLE in this situation. Forinstance, one could study howtheMLEcouldbe regularizedorcouldinvestigatewhethera Stein-like approach is possible for thistwo datasets framework. Alternatively,a frequentistapproach wherejointestimationofR1

andWis performedundersome constraintsconstitutesaworthy pathofinvestigation.

DeclarationofCompetingInterest

None.

AppendixA. Extensiontocomplex-valueddata

In this appendix, we brieﬂy show that the derivations con-cerning the maximum likelihood estimatorcan be extended in a straightforward mannerto the complexcase. Letusassume here that X1

|

R1=d CN

(

0,R1,I

)

and X2

|

R2=dCN

(

0,R2,I

)

are

complex-valued data and distributed according to a circularly symmetric complex-valued matrix-variatenormaldistribution.LetR1=G1GH1

-whereH _stands_for_the _Hermitian_transpose-_and_R

2=G1W−1GH1

whereW=dCWp

ν

,

μ

−1I

followsacomplex Wishartdistribution.

Thestatistical(complex-valued)modelisthus

p

(

X1,X2

|

R1,W

)

=

π

−p(n1+n2)

|

R1

|

−n1

W−1R1

−n2 ×etr

−XH 1R−11 X1− X2HG−H1 WG−11 X2

(A.1a) p

(

W

)

= _˜

μ

νp

p

(

ν

)

|

W

|

ν−petr

{

−

μ

W

}

(A.1b)

Notethat,inthecomplexcase, E

W−1

=

(

ν

− p

)

−1

_μ

_I_[40] _so

that E

{

R2

}

=E

G1W−1GH1

=

(

ν

− p

)

−1

_μ

_R₁_. _Therefore, _for _E

_{

_R₂

_}

tobe equalto R1,onemusthave

μ

=

ν

− p inthe complexcase,

insteadof

μ

=

ν

− p− 1intherealcase. Themarginaldistributionof(X1,X2)isnow

p

(

X1,X2

|

R1

)

= W>0 p

(

X1,X2

|

R1,W

)

p

(

W

)

dW =

π

−p(_˜n1+n2)

μ

νp

p

(

ν

)

|

R1

|

−(n1+n2)etr

−XH 1R−11 X1

× W>0

|

W

|

ν+n2−p_etr

_−W

μ

_I₊_G−1 1 X2XH2G−H1

dW =

π

−p(n1+n2)

μ

νp

˜p

(

ν

)

˜

p

(

ν

)

|

R1

|

−(n1+n2) etr

−XH 1R−11 X1

μ

I+G−11 X2XH2G−H1

−(ν+n2) =

π

−pn1

|

_R 1

|

−n1etr

−XH 1R−11 X1

×

π

−pn_˜2

˜p

(

ν

)

p

(

ν

)

|

μ

R1

|

−n2

I+XH2[

μ

R1]−1X2

−(ν+n2) (A.2)

andwerecoverthefactthatX1|R1isGaussiandistributedandthat X2|R1isStudentdistributed.From(A.2),thelog-likelihoodfunction

is,uptoanadditiveandconstantterm

˜ f(R1)=−(n1+n2)log|R1|−(ν+n2)logI+μ−1R−11 S2− Tr R−1₁ S1 =(ν− n1)log|R1|−(ν+n2)logR1+μ−1S2− Tr R−1₁ S1 (A.3)

Maximum likelihood covariance matrix estimation from two possibly mismatched data sets

an author's

https://oatao.univ-toulouse.fr/25984

https://doi.org/10.1016/j.sigpro.2019.107285

Besson, Olivier Maximum likelihood covariance matrix estimation from two possibly mismatched data sets. (2020)

Signal Processing, 167. 107285-107294. ISSN 0165-1684

Maximum

likelihood

covariance

matrix

estimation

from

two

possibly

mismatched

data

sets

Olivier

Besson

a

b

s

t

r

a

c

t

ϕ

(

λ

)

λ

ϕ

λ

λ

α

β

α

β

|

(

)

(

)

(

)

(

π

)

|

|

|



|

{



}

(

)

λ

(

)

λ

(

)

ν

μ

ν

μ

(

|

)

(

π

)

|

|







_ϕ

_λ

)

₍

_λ

_μ

_{

_}

_ν