an author's
https://oatao.univ-toulouse.fr/25984
https://doi.org/10.1016/j.sigpro.2019.107285
Besson, Olivier Maximum likelihood covariance matrix estimation from two possibly mismatched data sets. (2020)
Signal Processing, 167. 107285-107294. ISSN 0165-1684
Maximum
likelihood
covariance
matrix
estimation
from
two
possibly
mismatched
data
sets
Olivier
Besson
ISAE-SUPAERO, 10 Avenue Edouard Belin, Toulouse 31055, France
Keywords:
Covariance matrix estimation Maximum likelihood Mismatch
a
b
s
t
r
a
c
t
Weconsiderestimatingthecovariancematrixfromtwodatasets,onewhosecovariancematrixR1isthe soughtoneandanothersetofsampleswhosecovariancematrixR2slightlydiffersfromthesoughtone, duee.g.todifferentmeasurementconfigurations.Weassumehoweverthatthetwomatricesarerather close, whichweformulatebyassuming thatR11/2R2−1R11/2|R1 followsaWishartdistributionaround the identitymatrix.Itturnsoutthatthisassumptionresultsintwodatasetswithdifferentmarginal distri-butions,hencetheproblembecomesthatofcovariancematrixestimationfromtwodatasetswhichare distribution-mismatched.Themaximumlikelihoodestimator(MLE)isderived andisshowntodepend onthevaluesofthenumberofsamplesineachset.Weshowthatitinvolveswhiteningofonedataset bytheotherone,shrinkage ofeigenvaluesand colorization,atleastwhenonedatasetcontainsmore samplesthanthesizepoftheobservationspace.Whenbothdatasetshavelessthanpsamplesbutthe totalnumberislargerthanp,theMLEagainentailseigenvaluesshrinkagebutthistimeafteraprojection operation.Simulationresultscomparethenewestimatortostate of the art techniques.
1. Problemstatement
Analysisorprocessingofmultichanneldatamostoftenrelieson thecovariancematrix,whichisafundamentaltoole.g.,for princi-palcomponentanalysis,spectralanalysis, adaptivefiltering, detec-tion,directionofarrivalestimationamongothers[1–3].Inpractical applications,the p× p covariancematrix Rneeds tobe estimated froma finitenumbernofsamples.When thelatterare indepen-dentandGaussiandistributed,themaximumlikelihoodestimator ofRisn−1S whereXisthe p× n datamatrixandS=XXT is the
samplecovariancematrix(SCM)[1].However,inlowsample sup-port orwhen deviationfromtheGaussian assumptionisathand, theSCMtendstobehavepoorly.Inparticularitwasobservedthat thesamplecovariancematrixisusuallylesswell-conditionedthan the true covariancematrix, and thereforeconsiderable efforthas been dedicatedto regularizingit withaview to improveits per-formance.
One ofthe mostimportantapproach in thisrespectis dueto Stein [4–6] who, instead of maximizing the likelihood function, advocated tominimize a meaningfullossfunction within agiven classofestimators.Steinhenceintroducedtheconceptof admissi-bleestimationandminimaxestimatorsundertheso-calledStein’s loss.HeshowedthattheSCM-basedestimatorisnotminimaxand
E-mail address: [email protected]
derivedminimax estimators intwoimportantclasses,namely es-timatorsofthe formRˆ=GDGT whereDisadiagonalmatrixand
Gisthe CholeskyfactorofS,orof theformRˆ=Udiag
ϕ
(
λ
)
UTwhereUdiag(
λ
)UTistheeigenvaluedecompositionofSandϕ
(λ
)isanon-linearfunction of
λ
.ThisseminalworkofSteingaverise to a great number of studies, see forinstance [7–13] and refer-encestherein.Asecond classofrobust estimatesisbasedon lin-ear shrinkageof the SCMto a target matrix(an approach which can be interpreted as an empirical Bayes technique), i.e., esti-matesof theformRˆ=α
Rt+β
S whereRt=I isthe mostwidelyspread choice, see e.g., [14–20]. Note that these techniques ap-plied with Rt=I achieve an affine transformation of the
eigen-values of S, while retaining the eigenvectors, andtherefore bear resemblancewith Stein’smethod, although theselection of
α
,β
may not be driven by the same principle. Robustness to a pos-sibly non Gaussian distribution has also been a topic of consid-erable interest andmany papers havefocused on robust estima-tionforellipticallydistributeddata,seee.g.,[21–30]andreferences therein.
Mostof the above cited works deal with estimation of a co-variancematrixfromasingledataset.Inthispaper,weconsidera situationwheretwodatasetsX1andX2areavailable,with
respec-tive covariancematricesR1 andR2.Thissituationtypically arises
inradarapplicationswhenonewishestodetectatargetburiedin clutter with unknown statistics [31,32].In order to infer the lat-ter,trainingsamplesaregenerallyused,whichhopefullysharethe
samestatisticsastheclutterinthecellundertest(CUT).However, it has been evidenced that clutter is most often heterogeneous
[31],withadiscrepancycomparedtotheCUTthatmaygrowwith the distance to the CUT [33]. Therefore, one is led to use some clustering that separates training samples, either based on their proximity to the CUT or by means of some statistical criterion, suchasthepowerselectedtraining[34].Thesamplessoselected are deemed to be representative ofthe clutter inthe CUT while othersarelessreliable,whichcorrespondstothesituation consid-eredherein. A second example is in the field of synthetic aper-tureradarinthecasewhereasceneisimagedontwoconsecutive days,withpossiblechangesinbetween[35].Finally,in hyperspec-tralimagery,theproblemoftarget oranomaly detectionleadsto averysimilarframework.Indeed,thebackgroundinapixelunder testhastobeestimatedfromthelocalpixelsaroundandpixels lo-catedfurtherapart[36].Inthepresentpaper,we assumethat R2
isclosetoR1,thecovariancematrixwewishtoestimate.SinceR2
differs frombutis closeto R1 we investigateusing both X1 and X2 toestimateR1.ThereasonforusingalsoX2 isthatdespiteits
covariancematrixisnot R1,itiscloseto.Additionally,one might
facesituationswherethe numberofsamplesinX1 is verysmall.
Thispaperconstitutesafirstapproachtothisspecificproblemand wefocushereinonthe mostnaturalapproach,namelymaximum likelihoodestimation. The objectiveis to figureout thepros and consof the latterand the conditionsunder which it is an accu-rateestimator.The paperisorganizedasfollows.Insection 2we formulate thestatistical assumptions:more precisely, we assume thatR11/2R−12 R11/2
|
R1 isa random matrixwitha Wishartdistribu-tion around the identity matrix, and we derive the joint distri-bution of (X1, X2). Section 3 is devoted to the derivation of the
maximumlikelihoodestimatorofR1from(X1,X2),takinginto
ac-countthepossibleconfigurationsregardingthenumberofsamples ineachdataset.Numericalsimulationsillustratetheperformance oftheMLEandcompareitwithexistingalternatives insection 4. Conclusionsandpossibleextensionsofthepresentworkaredrawn insection5.
2. Datamodel
Let us assume that we have two sets of measurements
X1(p× n1) and X2(p× n2) which are distributed according to X1=d N
(
0,R1,I)
and X2=d N(
0,R2,I)
where N(
0,,
)
de-notes the matrix-variate normal distribution whose density is
(
2π
)
−pn/2|
|
−n/2|
|
−p/2etr{
−12XT
−1X
−1
}
with |.|thedetermi-nant and etr{.} the exponential of the trace of a matrix. Note that we consider real-valued data here whereas in radar appli-cations it is customary to consider complex-valued signals. In
Appendix A we show how the results below can be readily
ex-tended to the complex case. Our goal in this paper is to esti-mateR1, usingboth X1 andX2 even if R1=R2.However we
as-sumethat the two matrices are close to each other. In orderto define a model that can reflect the proximity between R1 and R2, we note that the natural distancebetween them is givenby
d2
(
R1,R2
)
=kp=1log 2λ
k
(
GT1R−12 G1)
[37,38]whereG1 isasquare-root of R1, i.e., R1=G1GT1 and
λ
k(
GT1R−12 G1)
stands for the ktheigenvalue of GT
1R−12 G1. This matrix is pivotal in adaptive
detec-tionproblemsalso.Moreprecisely,inthecaseofacovariance mis-matchbetweenthetrainingsamplesandthedataundertest,itis shownin [39] that the performance of the well-known adaptive matchedfilterdependsessentiallyonthismatrix.Therefore,it be-comes natural to encapsulate the difference between R1 and R2
through the matrixW=GT
1R−12 G1 and its proximity to the
iden-tity matrix. There are of course different ways to translate this constraintinthe model.Forinstance afrequentist approach may beadvocated wherethe jointprobability densityfunction of(X1,
X2) would be maximized under the constraintthat the distance
between W and I is smaller than some value. Alternatively, and thisis what we elect here, one can resortto an empirical Bayes approach wherethe randommatrix W followssome prior distri-butionratherconcentratedaroundI.Formathematicaltractability, we choose a conjugate prior for W and we assume that W fol-lowsaWishartdistributionwith
ν
degreesoffreedomand param-etermatrixμ
−1I,i.e.,W=dWpν
,μ
−1I.Ofcourse,thisisaratherstrongassumptionwhosevaliditywouldbedifficulttocheck,e.g., on realdata. However, it is inaccordance withthe mere knowl-edgewehaveabouttherelationbetweenR1andR2,anditallows
fortractablederivations.
UsingthefactthatX1|R1andX2|R2areindependentand
Gaus-sian distributed with respective covariance matrices R1 and R2,
andsinceR2=G1W−1GT1,wethusassumethefollowingstochastic
model: p
(
X1,X2|
R1,W)
=(
2π
)
−p(n1+n2)/2|
R1|
−n1/2W−1R1 −n2/2 ×etr−1 2X T 1R−11 X1− 1 2X T 2G−T1 WG−11 X2 (1a) p(
W)
=μ
νp/2 2νp/2p
(
ν
/2)
|
W|
(ν−p−1)/2etr −1 2μ
W (1b)Note that E
W−1=
(
ν
− p− 1)
−1μ
I so that E{
R2}
=E
G1W−1GT1=
(
ν
− p− 1)
−1μ
R1:therefore,forE{
R2}
tobeequaltoR1,one mustselect
μ
=ν
− p− 1.Observealso thatW comescloserto I as
ν
grows large.Indeed, E{
W}
=ν
(
ν
− p− 1)
−1I and E(
W− E{
W}
)
2=p
ν
(
ν
− p− 1)
2I which goesto zero asν
→∞[40].
Themarginaldistributionof(X1,X2) isobtainedbyintegrating
(1)withrespecttoW,whichresultsin
p
(
X1,X2|
R1)
= W>0 p(
X1,X2|
R1,W)
p(
W)
dW =(
2π
)
−p(n1+n2)/2μ
νp/2 2νp/2p
(
ν
/2)
|
R1|
−(n1+n2)/2etr −1 2X T 1R−11 X1 × W>0|
W|
(ν+n2−p−1)/2etr−1 2Wμ
I+G−11 X2XT2G−T1 dW =(
2π
)
−p(n1+n2)/2μ
νp/2 2νp/2p
(
ν
/2)
2(ν+n2)p/2p
((
ν
+n2)
/2)
×|
R1|
−(n1+n2)/2μ
I+ G−11 X2XT2G−T1 −(ν+n2)/2 etr−1 2X T 1R−11 X1 =(
2π
)
−pn1/2|
R 1|
−n1/2etr −1 2X T 1R−11 X1 ×π
−pn2/2p
((
ν
+n2)
/2)
p
(
ν
/2)
|
μ
R1|
−n2/2I+XT2[μ
R1]−1X2 −(ν+n2)/2 (2)Inordertoobtainthethirdequality,wemadeuseofthefactthat, ifS=dWp
(
ν
,)
, S>0 p(
S)
dS=1⇒ S>0|
S|
(ν−p−1)/2etr−1 2S−1 dS =2νp/2
p
(
ν
/2)
|
|
ν/2 (3)Note that p(X1, X2|R1) in (2) can be factored as
p
(
X1,X2|
R1)
=f1(
X1,R1)
× f2(
X2,R1)
which shows that X1and X2 are marginally independent and that p
(
X1,X2|
R1)
=p
(
X1|
R1)
p(
X2|
R1)
with p(
X1|
R1)
∝etr −1 2XT1R−11 X1and p
(
X2|
R1)
∝I+XT2[μ
R1]−1X2 −(ν+n2)/2.Duetothemodeladopted fortherandommatrixW=GT
1R−12 G1,X2 followsamatrixvariate
Student distribution [41]. Therefore, the fact that R2=R1 results
X1 is Gaussian distributed with covariance matrix R1 while the
uncertaintyin R2 leadstoa StudentdistributionforX2.Thisisa
ratheroriginalsituationwhereonehastocarrycovariancematrix estimationfromtwodatasetswhicharemismatchedintheir dis-tributions. Thispeculiarity will resultin new schemes compared totheconventionalcaseofasinglesetwithgivendistribution,as detailednow.
3. Maximumlikelihoodestimation
Inthissection we addressestimation ofR1 from(X1,X2) and
we focusonthemostnaturalestimator,i.e., themaximum likeli-hood estimator.From (2),the log-likelihoodfunction is,up to an additiveandconstantterm
f(R1)=− n1+n2 2 log|R1|− ν+n2 2 logI+μ −1R−1 1 S2− 1 2Tr R−11 S1 =ν− n1 2 log|R1|− ν+n2 2 logR1+μ−1S2− 1 2Tr R−11 S1 (4)
where S1=X1XT1 and S2=X2XT2. Differentiating the previous
equation andusingthefact thatd
|
R|
=|
R|
TrR−1dRanddR−1= −R−1
(
dR)
R−1,weobtainthefollowingequationthattheMLsolu-tionshouldsatisfy
(
ν
− n1)
R−11 −(
ν
+n2)
R1+
μ
−1S2 −1+R−11 S1R−11 =0 (5)
Inordertosolve(5),wemustinvestigatevariousconfigurationsfor (n1,n2) asthesolutionwilldepend onthem.Before goingtothe
technicaldetails of eachcase, we give an overviewoftheresults obtained.
3.1. Summaryofresults
As is illustrated below, the expression of the maximum like-lihood estimator depends on the respective valuesof n1 and n2.
In the sequel three cases will be distinguished: a first situation where n1<pandn2≥ p, asecond one whichis themirror
situa-tion,namelyn1≥ pandn2<p,andfinallyathirdmorechallenging
casewheren1<p,n2<pandn1+n2≥ p.
Inthefirst[respectivelysecond]case, theMLsolutionisgiven by (11) [resp.(21)]: itwill be shownthat theestimation process entails whitening of X1 [resp. X2] by the inverse of the
square-rootofthesamplecovariancematrixofX2 [resp.X1],followedby
shrinkageofeigenvaluesandfinallycolorizationbythesquare-root ofthesamplecovariancematrixofX2[resp.X1].Thetechnique of
eigenvalue shrinkageisrather well knownbutusually applied to the SCMofa singleset: herein,due tothe presenceof twodata sets, this technique is applied to one data set after it has been whitened by the second one. Interestingly enough, the ML solu-tioncan alsobe writtenas(14)[resp.(22)],thatisasaweighted sumof theSCM ofeach dataset, wherethe weighting matrixis diagonal for one set of samples, andnon diagonal for the other set.
Finally, when n2<p, n1<p and n1+n2≥ p, the procedure
includes a partitioning between the subspace spanned by the columns of X2 and its orthogonal complement. In the former,
shrinkageofeigenvaluesis usedwhile,inthelatter,projectionof theSCMofX1isretained.
3.2. Casen1<pandn2≥ p
We consider first the casewhere n1<p and n2≥ p, i.e., n1 is
not large enough forS1 to be positive definite andone needs to
use X2 inorderto estimate R1, eventhough R2=R1. Eq.(5)can
berewrittenas
(
ν
− n1)
R−11 R1+μ
−1S2 −(
ν
+n2)
I+R−11 S1R−11 R1+μ
−1S2 =0 ⇒−(
n1+n2)
I+(
ν
− n1)
μ
−1R−11 S2+R−11 S1+μ
−1R−11 S1R−11 S2=0 ⇒−(
n1+n2)
R1S−12 R1+(
ν
− n1)
μ
−1R1+S1S2−1R1+μ
−1S1=0 ⇒R1S−12 R1−ν
− n 1μ
(
n1+n2)
I+n 1 1+n2 S1S−12 R1 −μ
(
n1 1+n2)
S1=0 (6)LetS2=L2LT2 andletusdefineR˜12=L2−1R1L−T2 andS˜1=L−12 S1L−T2 .
Then, pre-multiplying the previous equation by L−12 and post-multiplyingitbyL−T2 ,weobtain ˜ R2 12−
ν
− n1μ
(
n1+n2)
I+ 1 n1+n2 ˜ S1 ˜ R12− 1μ
(
n1+n2)
˜ S1=0 (7)Let w be an eigenvector of R˜12 associated with eigenvalue
ξ
.Then,
ξ
2w−ξ
ν
− n1μ
(
n1+n2)
I+ 1 n1+n2 ˜ S1 w− 1μ
(
n1+n2)
˜ S1w=0 ⇒ 1μ
(
n1+n2)
+ξ
n1+n2 ˜ S1w=ξ
ξ
−ν
− n1μ
(
n1+n2)
w (8)whichimpliesthat wisalso aneigenvector ofS˜1.Either itis
as-sociatedwithazeroeigenvalue(thereare p− n1 ofthem) and,in
thiscase,
ξ
= ν−n1μ(n1+n2),orit is associatedwith astrictly positive
eigenvalue
λ
andξ
satisfiesthesecond-orderpolynomialequationξ
2−ξ
ν
− n1μ
(
n1+n2)
+λ
n1+n2 −μ
(
λ
n1+n2)
= 0 (9)The above polynomial has obviously two real-valued roots, one being negative, the other being positive, and thus the latter is the eigenvalue of R˜12. To summarize, if we let L−12 X1=U
VT=
n1
k=1
σ
kukvTk be the singular value decomposition of L−12 X1, wehave ˜ R12= n1 k=1
ξ
kukuTk+ν
− n1μ
(
n1+n2)
p k=n1+1 ukuTk = n1 k=1ξ
k−μ
(
ν
− n1 n1+n2)
ukuTk+ν
− n1μ
(
n1+n2)
I (10)where
ξ
kisthepositiverootof(9)withλ
substitutedforσ
2k.The MLEofR1 isthus R1= n1 k=1
ξ
k−μ
(
ν
n− n1 1+n2)
L2ukuTkL T 2+ν
− n1μ
(
n1+n2)
S2 (11)Itisinstructivetostudytheformofthissolution.Theoriginaldata
X1 isfirst adaptively whitened by L−12 and its samplecovariance
matrixiscomputed.Theeigenvectorsofthelatterareretainedand theeigenvaluesaremodified. Then,dataisre-coloredby L2.Note
thatthetechniqueofregularizingeigenvalueswhilekeeping eigen-vectors is classical in robust covariance matrix estimation. How-ever,thistechniqueusually appliesto onesetofsamples.Here it appliestoone setofsamplesafterithasbeen“whitened” bythe other set.Indeedawhitening-colorizationoperation isperformed pre and posteigenvalues modification. Anotherimportant obser-vationisthatthetransformation
λ
→ξ
preservestheorderofthe eigenvalues,an importantissueinStein’sestimation using eigen-value decomposition [42–44]. Thiscan be seen by differentiating(9)withrespectto
λ
,whichgives∂ξ
∂λ
2ξ
−ν
− n1μ
(
n1+n2)
−λ
n1+n2 =ξ
n1+n2 + 1μ
(
n1+n2)
(12)Since the bracketed term on the left-hand side of the previous equation is positive, it follows that
∂
ξ
/∂
λ
>0 and therefore the transformationpreservesorderingoftheeigenvalues.Thisproperty willholdtrueintheothercasesdevelopedbelow.AcommentisalsoinorderregardingthebehavioroftheMLE when
ν
grow large,i.e., when W comes closertoI. Indeed,withμ
=ν
− p− 1,onehas lim ν→∞ξ
k= 1+λ
k n1+n2 ⇒ lim ν→∞R˜12= 1 n1+n2 ˜ S1+I ⇒ lim ν→∞R1= 1 n1+n2 L2 L−12 S1L−T2 +I LT 2 =n 1 1+n2 [S1+S2] (13)whichshowsthat,asWcomesclosertoI,i.e.,asR2 comescloser
to R1, the MLE is simply the sample covariance matrix of the
wholedata,asmaybeexpected.
Finally, another interpretation of theMLE can be obtainedby rewritingtheMLEinanotherform.Notingthattherangespaceof
L−12 X1coincides with the range space of u1,. . .,un1, it follows that
uk=L−12 X1
η
kforsome vectorη
k.Therefore,(11)canbe rewrittenas R1=X1
n1 k=1ξ
k−ν
− n1μ
(
n1+n2)
η
kη
Tk XT 1+ν
− n1μ
(
n1+n2)
S2 =X11XT1+
ν
− n1μ
(
n1+n2)
X2XT2 (14)Consequently,theMLEisaweightedversionofthesample covari-ancematricesofeachdataset.In fact,itcanbe shown(weomit thedetails)thatifasolutionto(5)issought whichisoftheform
(14),then
1 issolutiontotheequation
2 1+
ν
− n 1μ
(
n1+n2)
XT 1S−12 X1 −1 − 1 n1+n2 I1 −
ν
+n2μ
(
n1+n2)
2 XT 1S−12 X1 −1 =0 (15)Itensuesthat
1 andXT1S−12 X1sharethesameeigenvectors,which
are indeed the rightsingular vectors vk of L−12 X1. Moreover, the
eigenvalues
γ
kof1satisfy
γ
2 k +γ
k(
ν
− n 1)
σ
k−2μ
(
n1+n2)
− 1 n1+n2 −(
ν
+n2)
σ
k−2μ
(
n1+n2)
2 = 0 (16)To summarize, the MLE of R1 can either be written as in
(11)wheretheeigenvalues
ξ
karerelatedtotheeigenvaluesλ
kofL−12 S1L−T2 by(9),orasin(14)where
1isgivenby(15).
3.3.Casen2<pandn1≥ p
We now consider a situation where n2<p and n1≥ p under
whichonehasasufficientnumberof“good” samplesX1 forS1to
befull-rank.Yet,itmightbeofinteresttouseX2 eventhoughits
covariancematrixR2=R1.ThederivationoftheMLEfollowsalong
thesamelinesasinthepreviouscase,exceptthatnowS2is
rank-deficientandS1 isfull-rank. StartingfromtheML Eq.(5),oncan
write
(ν
− n1)
R−11 R1+μ
−1S2 −(ν
+n2)
I+R−11 S1R−11 R1+μ
−1S2 =0 ⇒−(
n1+n2)
I+(ν
− n1)
μ
−1R−11 S2+R−11 S1+μ
−1R−11 S1R−11 S2=0 ⇒−(
n1+n2)
R1S−11 R1+(ν
− n1)
μ
−1R1S−11 S2+R1+μ
−1S2=0 ⇒R1S−11 R1− R1(ν
− n1)
μ(
n1+n2)
S−11 S2+ 1 n1+n2 I −μ(
1 n1+n2)
S2=0 (17)LetS1=L1LT1andletusdefineR˜11=L1−1R1L−T1 andS˜2=L−11 S2L−T1 .
Then, taking the transpose of the previous equation, pre-multiplyingbyL−11 andpost-multiplyingbyL−T1 ,weobtain
˜ R2 11−
(
ν
− n1)
μ
(
n1+n2)
˜ S2+ 1 n1+n2 I ˜ R11− 1μ
(
n1+n2)
˜ S2=0 (18)As before, it can be seen that R˜11 andS˜2 share the same
eigen-vectors. The p− n2 eigenvectorsofS˜2 associatedwithzero
eigen-value will correspond to a constant eigenvalue for R˜11 equal to
(
n1+n2)
−1.Astrictlypositiveeigenvalueζ
ofR˜11 isrelatedtoitscounterpart
λ
ofS˜2byζ
2−ζ
λ
(
ν
− n1)
μ
(
n1+n2)
+ 1 n1+n2 −λ
μ
(
n1+n2)
=0 (19)Now,ifweletL−11 X2=Y
ZT=nk=12
θ
kykzTk be thesingularvaluedecompositionofL−11 X2,wehave ˜ R11= n2 k=1
ζ
kykyTk+ 1 n1+n2 p k=n2+1 ykyTk = n2 k=1ζ
k− 1 n1+n2 ykyTk+ 1 n1+n2 I (20)where
ζ
kisthepositiverootof(19)withλ
substitutedforθ
k2.TheMLEofR1becomes R1= n2 k=1
ζ
k− 1 n1+n2 L1ykyTkLT1+ 1 n1+n2 S1 (21)Again, since the range space of L−11 X2 is spanned by y1,...,yn2,
onehasyk=L−11 X2
χ
kandhenceR1=X2
n 2 k=1ζ
k− 1 n1+n2χ
kχ
Tk XT2+ 1 n1+n2 X1XT1 =X22XT2+ 1 n1+n2 X1XT1 (22)
Notethat (22)differs from(14) inthat theweighting matrix ap-pliedbetweenX1 andXT1 isnow diagonalwhilethat applied
be-tweenX2 andXT2 isnolongerdiagonal.Furthermore,ifone looks
forasolutionoftheform(22)then
2 isthesolutionto
2 2+
(
XT 2S−11 X2)
−1 n1+n2 −(
ν
− n1)
μ
(
n1+n2)
I2 −
ν
+n2μ
(
n1+n2)
2(
XT2S−11 X2)
−1=0 (23)2andXT2S−11 X2sharethesameeigenvectors(actuallyzk)andthe
eigenvalue
γ
kof2isobtainedasthepositivesolutionto
γ
2 k +γ
kθ
−2 k n1+n2−(
ν
− n1)
μ
(
n1+n2)
−θ
k−2(
ν
+n2)
μ
(
n1+n2)
2 = 0 (24)Remark 1. When n1≥ p andn2≥ p, the previous techniques can
stillbeused,withslightvariations.Inthiscase,S˜1andS˜2arenow
full-rank, andtherefore the MLE ofR1 is given by (11) but with
thefirst sumextendedto peigenvectors(S˜1 has nowpnon-zero
eigenvalues),andthesecondtermvanishes.TheMLsolutionisalso givenby (21) withthefirst termextended to peigenvectorsand thesecondtermvanishing.
3.4. Casen1<p,n2<pandn1+n2≥ p
We nowconsider themore challengingcasewhere neitherof the two data sets contains enough samples for their respective samplecovariance matrices to be full rank, andthus it becomes mandatorytocombinebothsets.Thissituationisabittrickierand requiressomecarefulness.Goingbackto(5),theMLEofR1should
satisfy
(
ν
− n1)
R−11 −(
ν
+n2)
R1+μ
−1S2 −1 +R−11 S1R−11 =0 ⇒(
ν
− n1)
R−11 +R−11 S1R−11−
(
ν
+n2)
R−11 −
μ
−1R−1 1 X2 I+μ
−1XT 2R−11 X2 −1 XT2R−11 =0 ⇒(
n1+n2)
R1=(
ν
+n2)
μ
−1X2 I+μ
−1XT 2R−11 X2 −1 XT 2+X1XT1 (25)Before pursuing, itis worthylooking atthe previous equation to get some insight. We observe that the projection of R1 onto the
subspace orthogonal to X2 will be equal to the projection of S1
onthissamesubspace.Thissuggeststouseadecompositionthat splits data in R
(
X2)
and its orthogonal complement. To do so,letusconsidertheSVDofX2 asX2=CDET=
Ca Cb Da0
ET=
CaDaET, where C is p× p, Ca is p× n2 and Da is the n2× n2
di-agonal matrix ofsingularvalues. Letusalso operatea changeof coordinatesanddefine
=CTR1C= CT aR1Ca CTaR1Cb CT bR1Ca CTbR1Cb =
aa
ab
ba
bb (26)
With these definitions, it is straightforward to show that
XT
2R−11 X2=EDa
−1a.bDaET where
a.b=
aa−
ab
−1bb
ba and
thus X2
I+μ
−1XT 2R−11 X2 −1 XT2=CaDaI+μ
−1Da−1a.bDa−1DaCTa =Ca D−2a +
μ
−1−1a.b −1 CT a (27)
Therefore,pre-multiplying(25)byCTandpost-multiplyingitbyC,
weobtain
(
n1+n2)
aa
ab
ba
bb =
(
ν
+n2)
μ
−1 D−2a +μ
−1−1a.b −1 0 0 0 + CT aS1Ca CTaS1Cb CT bS1Ca C T bS1Cb (28)
whichimmediatelyimpliesthat
(
n1+n2)
ba=CTbS1Ca
(
n1+n2)
bb=CTbS1Cb (29)
ThiscorroboratesthecommentswemadeafterEq.(25)sinceone has
(
n1+n2)
CbbbCTb=
(
n1+n2)
CbCTbR1CbCTb=P⊥X2R1P ⊥ X2 =CbCTbS1CbCTb=P⊥X2S1P ⊥ X2 (30)Itnowremainstofind
aaorequivalently
a.b.Towardsthisend,
notethat
(
n1+n2)
aa=
(
ν
+n2)
μ
−1 D−2a +μ
−1−1 a.b −1 +CT aS1Ca (31) However,
(
n1+n2)
aa=
(
n1+n2)
a.b+
ab
−1bb
ba =
(
n1+n2)
a.b+CTaS1Cb CT bS1Cb −1 CT bS1Ca (32) whichleadsto
(
n1+n2)
a.b=
(
ν
+n2)
μ
−1 D−2a +μ
−1−1a.b −1 +CTS 1C a.b (33)
For the sake of notational convenience, let us denote
F=
CTS1C
a.b. Post-multiplying the previous equation by
D−2a +μ
−1−1a.b resultsin (n1+n2)a.bD−2a − (ν− n1)μ−1I+FD−2a −μ−1F−1 a.b=0 ⇒a.bD−2a a.b− ν− n 1 μ(n1+n2)I+ 1 n1+n2FD −2 a a.b−μ(n1 1+n2)F=0 ⇒˜2 a.b− ν− n 1 μ(n1+n2) I+n 1 1+n2 ˜ F˜a.b− 1 μ(n1+n2) ˜ F=0 (34)
where
˜a.b=D−1a
a.bD−1a andF˜=D−1a FD−1a .Similarlytowhatwas
done before,
˜a.b and F˜ share the same eigenvectors. When the eigenvalue
λ
ofF˜ is zero(there areactually p− n1 ofthem [45])thecorrespondingeigenvalue
φ
of˜a.b is μ(νn−n1+1n2).Foreachofthe
r=n1+n2− p non-zero
λ
,thecorrespondingφ
istheuniquepos-itiverootof
φ
2−ν
− n1μ
(
n1+n2)
+λ
n1+n2φ
−μ
(
λ
n1+n2)
= 0 (35)Therefore,ifu˜karetheeigenvectorsofF˜,
˜a.bisgivenby ˜
a.b= r k=1
φ
ku˜ku˜Tk+ν
− n1μ
(
n1+n2)
p k=r+1 ˜ uku˜Tk = r k=1φ
k−ν
− n1μ
(
n1+n2)
˜ uku˜Tk+ν
− n1μ
(
n1+n2)
I (36)Once
˜a.b iscomputed,
a.b=Da
˜a.bDa and
aacanbe obtained
from(32).Finally,theMLEofR1 isgivenbyC
CT.
We now present an alternative way to compute the solution. From (25), it appears that R1 can be written as
(
n1+n2)
R1= X1XT1+X22XT2 where
2=
(
ν
+n2)
μ
−1I+μ
−1XT2R−11 X2 −1. Let
X=
X1 X2and let XT=QR be the QR decomposition of XT
withQa
(
n1+n2)
× p semi-unitary matrix,i.e., QTQ=Ip.LetuspartitionQasQ=
Q1 Q2 sothatXT1=Q1RandXT2=Q2R.Then,one
has
(
n1+n2)
R1=X1XT1+X22XT2 =RTQT 1Q1+QT2
2Q2 R andtherefore
(
n1+n2)
−1XT2R−11 X2=Q2 QT 1Q1+QT22Q2 −1 QT 2 =Q2 I+QT2
(
2− I
)
Q2 −1 QT2 =Q2I− QT 2
(
2− I
)
−1+Q2QT2 −1 Q2 QT 2 =Q2QT2− Q2QT2(
2− I
)
−1+Q2QT2 −1 Q2QT2 =Q2QT2 −1 +
2− I −1 (37) Consequently,ifwedefineB2=Q2QT2 −1 − I
−1 2 =
(
ν
+n2)
−1μ
I+μ
−1XT 2R−11 X2 =(
ν
+n2)
−1μ
I+(
ν
+n2)
−1XT2R−11 X2 =(
ν
+n2)
−1μ
I+(
ν
+n2)
−1(
n1+n2)
(
2+B2
)
−1 (38)Pre-multiplying the previous equation by
(
2+B2
)
andpost-multiplying by
2, weobtain the followingsecond-order
polyno-mialequation:
2 2+
B2−
ν
− nμ
1I2−
ν
+ n2μ
B2=0 (39)It follows that
2 and B2 share the same eigenvectors. If
λ
is anon-zeroeigenvalueofB2(therearen1+n2− p of them),then the
correspondingeigenvalue
γ
of2istheuniquepositiveroottothe
followingpolynomialequation
γ
2+λ
−ν
− n1μ
γ
−ν
+n2Fig. 1. Average distance between ˆ R1 and R 1 in case 1.
If
λ
=0 thenγ
=(
ν
− n2)
μ
−1. Finally, the solution2 is given by
2= r k=1
γ
kbkbTk+ν
− n1μ
n2 k=r+1γ
kbkbTk = r k=1γ
k−ν
− nμ
1 bkbTk+ν
− n1μ
I (41)wherebkaretheeigenvectorsofB2.Notethat
B2b=
λ
b⇒(
Q2QT2)
−1b− b=λ
b⇒
(
Q2QT2)
−1b=(
1+λ
)
b⇒
(
Q2QT2)
b=(
1+λ
)
−1bandhencebisaneigenvectorofQ2QT2 associatedwitheigenvalue
(
1+λ
)
−1, or equivalentlya right singularvector of QT2. Observe
alsothat,sinceXT
2=Q2RandXXT=RTQTQR=RTR,onehas
Q2QT2=XT2R−1R−TX2=XT2
(
RTR)
−1X2=XT
2
(
XXT)
−1X2=XT2(
X1XT1+X2XT2)
−1X2Hence,ifweletS=X1XT1+X2XT2=LLT,thenQT2 andL−1X2 share
thesamerightsingularvectors.
4. Numericalsimulations
In this section, we evaluate numerically the performance of the MLE presented above through Monte-Carlo simulations. We consider a scenario where the size of the observation space is
p=128.Threecaseswillbe considered forthe covariancematrix
R1, which correspond to different kind of processes. In the first
casethe(k,)elementisR1
(
k,)
=Pρ
|k−|+δ
(
k,)
withρ
=0.7.ThesecondcaseassumesthatR1
(
k,)
=Pe−0.5(2πσf|k−|) 2+
δ
(
k,)
with
σ
f=0.02. In the third case, R1(
k,)
=rAR(
|
k−|
)
+δ
(
k,)
whererAR
(
|
k−|
)
correspondstothecorrelationofanautoregres-siveprocesswhosepolesarelocatedat0.95e± i2π0.05,0.9e± i2π0.15,
0.9e± i2π0.18. FinallyP=100 and r
AR
(
0)
=100. The correspondingprocesses are rather lowpass in case1 and 2,while case 3 con-cernsprocesseswithsharppeaksintheirspectrum.Ineach simu-lationX1isgeneratedfromaGaussiandistributionwithcovariance
matrixR1.Then W isgeneratedfromaWishartdistribution with
ν
=p+2degreesoffreedomandparametermatrix(
ν
− p− 1)
−1IandR2iscomputedasR2=G1W−1GT1.ThenX2isgeneratedfrom
aGaussiandistributionwithcovariancematrixR2.
The MLE is compared with four competitors. The first is the sample covariancematrixbased on all samples,i.e.,
(
n1+n2)
−1Swhere S=X1XT1+X2X2T. The second is of the form GSCMDGTSCM,
where GSCM isthe Choleskyfactorof S, andDis a diagonal
ma-trix which is chosen to minimize Stein’s loss and is given by
Dk,k=1/
(
n1+n2+p− 2k+1)
.Thethird isofthesameformbutis meant at minimizing the natural distance between R1 and
its estimate: as shown in [13], it amounts to choosing Dk,k=
exp
−E
logχ
2n1+n2−i+1
. Finally, we consider the class of or-thogonally invariant estimators of the form USCMdiag
ϕ
(
λ
)
UT SCM whereS=USCMdiagλ
UTSCM istheeigenvaluedecompositionofS
and
ϕ
(
λ
)
=ϕ
1(
λ
)
...ϕ
p(
λ
)
.Stein showed that the choiceϕ
k=λ
k/(
n1+n2− p+1+2λ
kj=k(
λ
k−λ
j)
−1)
is the best withrespect to Stein’s loss. However this choice has two drawbacks: it can result in some
ϕ
k<0 and it does not preserve the order of theeigenvaluesλ
k, whichis aproblem [42].In ordertoover-come theseproblems,Stein proposed an isotonizingscheme that guarantees
ϕ
k>0 andpreservesorder, see[46] fordetails ofthisscheme.We considerthisimprovedestimatorasthefourth alter-native. The figure of merit for all estimators will be the natural
Fig. 4. Average distance between ˆ R1 and R 1 in case 1 versus ν. n 1 + n 2 = 2 p.
distancebetweenthetrueandthe estimatedcovariancematrices
d2
(
R1,Rˆ1
)
=kp=1log2λ
k(
R−11 Rˆ1)
.The simulation resultsare shownin Figs. 1-4 where we con-siderdifferentvaluesforthetotalnumberofsamplesn=n1+n2,
namelyn=p,n=3p/2andn=2p.Themainconclusions regard-ingthesesimulationsarethefollowing:
• the MLE is shown to outperform its competitors when n1 is
small and n is large enough, typically it has the best perfor-manceforn=3p/2andn=2p. Onecanobservethat the im-provement achievedby the MLEismore importantwhen n= 2p andn1 issmall,i.e.,whenonehasveryfewsamplesdrawn
fromR1 andalargemajorityofsamplesdrawnfromR2.
• in contrast, when n=p the other methods can perform bet-ter thantheMLE,especiallywhenn1 isaboveathreshold,i.e.,
whenthenumberof“good” samplesislargeenough.
• among the Stein-like methods, that based on eigenvalue de-composition (with isotonizing) is the best, but the method based on Cholesky factorization and minimization of the geodesicdistancecomesveryclose.
Inafinalsimulation,weevaluatetheinfluenceof
ν
:recallthat, asν
increases,WisclosertoI andthusR2 isclosertoR1,whichmeansthat X2 shouldbenearlyasinformativeasX1.InFig.4we
display the average distance as a function of
ν
in case 1 withn1+n2=2p. Itis observedthat, as
ν
increases, theperformanceofallestimatorsimprove.TheproposedMLEisnolongerthemost accurate above a threshold,where it isdominated by the Stein’s estimatorbased on the eigenvalues of the whole sample covari-ancematrix.However,theproposedMLEstillperformsbetterthan allotherestimators.
5. Conclusions
Inthispaper,weconsideredtheproblemofestimatinga covari-ancematrix R1 fromtwo datasets, oneset X1 whosecovariance
matrixisactuallyR1 andanothersetX2whosecovariancematrix R2 isdifferentbutclosetoR1.SincethedistancebetweenR1 and R2dependsontheeigenvaluesofW=GT1R−12 G1,weembeddedthe
latterinastatisticalmodelandassumedthatitfollowedaWishart distributionaroundtheidentitymatrix.Weshowedthatthe prob-lemisthatofestimatingR1fromtwodatasetswithdifferent
dis-tributions.Themaximumlikelihoodestimatorwasderivedandits expressionwasshowntodependonthenumberofsamplesinX1
andX2. TheMLE wasshowntoperform quitewell, ascompared
to stateof theartalgorithms, atleastwhen thenumberof sam-ples in X1 is small and the total number of samples n is large
enough. However, as ina classical framework witha singledata set,thereisroomfromimprovementoftheMLE,especiallyinlow samplesupport.Therefore,futurework should bedevoted to im-proving the MLE in this situation. Forinstance, one could study howtheMLEcouldbe regularizedorcouldinvestigatewhethera Stein-like approach is possible for thistwo datasets framework. Alternatively,a frequentistapproach wherejointestimationofR1
andWis performedundersome constraintsconstitutesaworthy pathofinvestigation.
DeclarationofCompetingInterest
None.
AppendixA. Extensiontocomplex-valueddata
In this appendix, we briefly show that the derivations con-cerning the maximum likelihood estimatorcan be extended in a straightforward mannerto the complexcase. Letusassume here that X1
|
R1=d CN(
0,R1,I)
and X2|
R2=dCN(
0,R2,I)
arecomplex-valued data and distributed according to a circularly symmetric complex-valued matrix-variatenormaldistribution.LetR1=G1GH1
-whereH standsforthe Hermitiantranspose-andR
2=G1W−1GH1
whereW=dCWp
ν
,μ
−1Ifollowsacomplex Wishartdistribution.Thestatistical(complex-valued)modelisthus
p
(
X1,X2|
R1,W)
=π
−p(n1+n2)|
R1|
−n1W−1R1 −n2 ×etr−XH 1R−11 X1− X2HG−H1 WG−11 X2(A.1a) p
(
W)
= ˜μ
νpp
(
ν
)
|
W|
ν−petr{
−μ
W}
(A.1b)Notethat,inthecomplexcase, E
W−1=
(
ν
− p)
−1μ
I[40] sothat E
{
R2}
=EG1W−1GH1
=
(
ν
− p)
−1μ
R1. Therefore, for E{
R2}
tobe equalto R1,onemusthave
μ
=ν
− p inthe complexcase,insteadof
μ
=ν
− p− 1intherealcase. Themarginaldistributionof(X1,X2)isnowp
(
X1,X2|
R1)
= W>0 p(
X1,X2|
R1,W)
p(
W)
dW =π
−p(˜n1+n2)μ
νpp
(
ν
)
|
R1|
−(n1+n2)etr −XH 1R−11 X1× W>0
|
W|
ν+n2−petr−Wμ
I+G−1 1 X2XH2G−H1dW =
π
−p(n1+n2)μ
νp˜p
(
ν
)
˜p
(
ν
)
|
R1|
−(n1+n2) etr−XH 1R−11 X1μ
I+G−11 X2XH2G−H1 −(ν+n2) =π
−pn1|
R 1|
−n1etr −XH 1R−11 X1×
π
−pn˜2˜p
(
ν
)
p
(
ν
)
|
μ
R1|
−n2I+XH2[μ
R1]−1X2 −(ν+n2) (A.2)andwerecoverthefactthatX1|R1isGaussiandistributedandthat X2|R1isStudentdistributed.From(A.2),thelog-likelihoodfunction
is,uptoanadditiveandconstantterm
˜ f(R1)=−(n1+n2)log|R1|−(ν+n2)logI+μ−1R−11 S2− Tr R−11 S1 =(ν− n1)log|R1|−(ν+n2)logR1+μ−1S2− Tr R−11 S1 (A.3)