HAL Id: hal-01634877
https://hal-upec-upem.archives-ouvertes.fr/hal-01634877
Submitted on 14 Nov 2017
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
Data-driven kernel representations for sampling with an
unknown block dependence structure under correlation
constraints
Guillaume Perrin, Christian Soize, N. Ouhbi
To cite this version:
Guillaume Perrin, Christian Soize, N. Ouhbi. Data-driven kernel representations for sampling with
an unknown block dependence structure under correlation constraints. Computational Statistics and
Data Analysis, Elsevier, 2018, 119, pp.139-154. �10.1016/j.csda.2017.10.005�. �hal-01634877�
an unknown blo k dependen e stru ture under orrelation onstraints G.Perrin a ,C. Soize b ,N. Ouhbi a
CEA/DAM/DIF, F-91297, Arpajon,Fran e
b
UniversitéParis-Est,MSME UMR8208 CNRS,Marne-la-Vallée, Fran e
Innovation andResear hDepartment, SNCF,Paris, Fran e
Abstra t
The multidimensionalGaussian kernel-density estimation (G-KDE) is a
powerful tool to identify the distribution of random ve tors when the
max-imal information is a set of independent realizations. For these methods, a
key issue is the hoi e of the kernel and the optimization of the bandwidth
matrix. To optimize these kernel representations, two adaptations of the
lassi al G-KDE are presented. First, it is proposed to add onstraints on
the mean and the ovarian e matrix in the G-KDE formalism. Se ondly, it
is suggested to separate in dierent groups the omponents of the random
ve tor of interest that ould reasonably be onsidered as independent. This
blo k by blo k de omposition is arried out by lookingfor the maximum of
a ross-validation likelihood quantity that is asso iated with the blo k
for-mation. This leadstoatensorizedversion ofthe lassi alG-KDE.Finally,it
is shown onaseries of exampleshowthese two adaptations an improvethe
nonparametri representations of the densities of randomve tors, espe ially
when thenumberofavailablerealizationsisrelativelylow omparedtotheir
dimensions.
Key words:
Kernel density estimation, optimal bandwidth,nonparametri
representation, data-driven sampling
The generation of independent realizations of a se ond-order
R
d
-valued
random ve tor
X
,whose distribution,P
X
(dx)
,is unknown but an only beapproximated from a nite set of
N ≥ 1
realizations, is a entral issue inun ertainty quanti ation,signalpro essingand dataanalysis. Onepossible
approa h to address this problem is to suppose that the sear hed
distribu-tion belongs to an algebrai lass of distributions, whi h an be mapped
from arelatively smallnumberof parameters (for instan e, the
multidimen-sional Gaussiandistribution). Generating newrealizations ofrandomve tor
X
amounts therefore at identifying the parameters that best suit theavail-abledata andthen,atsamplingindependentrealizationsasso iatedwiththe
identied parametri distribution. However, whenthe dependen e stru ture
asso iated with the omponentsof
X
is omplex, su h that its distributionan be on entrated on an unknown subset of
R
d
, the denition of a
rele-vantparametri lasstorepresent
P
X
(dx)
an be omeverydi ult. Inthatase, nonparametri approa hes are generally preferred to these parametri
onstru tions [16, 11℄. In parti ular, the multidimensional Gaussian
kernel-density estimation (G-KDE) method approximates the probability density
fun tion (PDF) of
X
, if it exists, as a sum ofN
multidimensionalGaus-sian PDFs, whi h are entred at ea h available independent realization of
X
. Optimizing the ovarian e matri es asso iated with theseN
PDFs is aentral issue, as they ontrol the inuen e of ea h realization of
X
on thenal approximation of
P
X
(dx)
. Even if there are many ontributions onthis subje t (see for instan e [5, 4, 3, 6, 18℄),when the dimension
d
ofX
ishigh (
d ∼ 10 − 100
), onstant ovarian e matri esparametrizedby a uniques alingparameter aregenerally onsidered. Inparti ular, theSilvermanrule
of thumb [12℄ for hoosing this s aling parameter is widely used be ause of
its simpli ity and its good asymptoti behaviour when
N
tends to innity.However, forxed valuesof
N
,this Silverman hoi e oftenoverestimates thes attering of
P
X
(dx)
, and an have di ulties to orre tly on entrate thenew generated realizations of
X
on their regionsof high probability.To over ome this problem,a two-step pro edureis introdu ed. First, we
suggestto enter andtoun orrelatetherandomve tor
X
(usingaPrin ipalComponent Analysis for instan e). Then, based on the maximization of a
global"Leave-One-Out"likelihood,the ideaistoseparateindierentblo ks
the elements of
X
, whi h ould reasonably be onsidered as statisti allynumber of realizations of
X
, the less elements there are in ea h group, themore han e we have to orre tly infer the multidimensionaldistribution of
ea hsub-ve tor onstituted ofea hgroupelements,andsothe bettershould
be the estimation of the PDF of
X
. Nevertheless, the identi ation of this(unknown) blo k de omposition is a di ult ombinatorial problem. This
paperpresentsthereforetwoalgorithmstondrelevantblo kde ompositions
in areasonable omputationaltime.
The outline of this work is as follows. Se tion 2 presents the theoreti al
frameworkasso iatedwiththeG-KDEandtheoptimizationofthe ovarian e
matri es on whi h it is based. The blo k de omposition we propose is then
detailedinSe tion3. At last,the e ien y ofthe methodis illustratedona
series of analyti and industrialexamples inSe tion 4.
2. Theoreti al framework
Let
X
:= {X(ω), ω ∈ Ω}
be a se ond-order random ve tor dened ona probability spa e
(Ω, T , P)
, with values inR
d
. We assume that the
prob-ability density fun tion (PDF) of
X
exists. By denition, this PDF, whi his denoted by
p
X
, is an element ofM
1
(R
d
, R
+
)
, the set of positive-valued
fun tions, whose integral over
R
d
is 1. It is assumed that the maximal
available information about
p
X
is a set ofN > d
independent anddis-tin t realizations of
X
, whi h are gathered inthe deterministi setS(N) :=
{X(ω
n
), 1 ≤ n ≤ N}
. Given these realizations ofX
, the kernel estimator ofp
X
isb
p
X
(x; H, S(N)) =
det(H)
−1/2
N
N
X
n=1
K
H
−1/2
(x − X(ω
n
))
,
(1)where det
(·)
is the determinant operator,K
isany fun tion ofM
1
(R
d
, R
+
)
,and
H
is a(d × d)
-dimensional positive denite symmetri matrix that is generally referred as the "bandwidth matrix". In the following, wefo us onthe lassi al ase when
K
is the Gaussian multidimensionaldensity. Hen e,the PDF
p
X
is approximated by a mixture ofN
Gaussian PDFs, for whi hthe means are the available realizations of
X
and the ovarian e matri esb
p
X
(x; H, S(N)) =
1
N
N
X
n=1
φ (x; X(ω
n
), H) ,
x
∈ R
d
,
(2)where for any
R
d
-dimensional ve tor
µ
and for any(R
d
× R
d
)
-dimensional
symmetri positivedenitematrix
C
,φ(·; µ, C)
isthePDFofanR
d
-dimensional
Gaussian randomve tor with mean
µ
and ovarian ematrixC
:φ (x; µ, C) :=
exp
−
1
2
(x − µ)
T
C
−1
(x − µ)
(2π)
d/2
p
det(C)
,
x
∈ R
d
.
(3)By onstru tion, the matrix
H
in Eq. (2) hara terizes the lo alon-tribution of ea h realization of
X
. Thus, its value has to be optimized tominimizethedieren ebetween
p
X
,whi hisunknown, andb
p
X
(·; H, S(N))
.The mean integrated squarederror (MISE) performan e riterion
MISE
(H; d, N) = E
Z
R
d
(p
X
(x) − b
p
X
(x; H, S(N)))
2
dx
(4)is generally onsideredtoquantify su ha dieren e. Here
E [·]
is themathe-mati alexpe tation. Forthis riterion,it an benoti edthattheset
S(N)
israndom,whereas intherest ofthispaperitisdeterministi . Givensu ient
regularity onditions on
p
X
, an asymptoti approximation of this riterionanbederived. Inlowdimension,thevalueof
H
thatminimizesthisasymp-toti riterion an be expli itly al ulated, but itsvalue depends onthe
un-known PDF
p
X
and its derivatives (see [10℄ for more details). Studies havetherefore been ondu ted to estimate these fun tions (generally iteratively)
from the onlyavailableinformationgiven by
S(N)
(see forinstan e [11, 5℄).However, the onvergen e ofthesemethodsis ratherslowinhighdimension,
su h that in pra ti e, a widely used value for
H
is given by the Silvermanbandwidth matrix
H
Silv(d, N) := (h
Silv(d, N))
2
b
σ
2
1
0
· · ·
0
0
bσ
2
2
. . . . . . . . . . . . . . .0
0
· · ·
0
b
σ
2
d
(5)where for all
1 ≤ i ≤ d
,bσ
2
i
is the empiri alestimation of the varian e ofX
i
,h
Silv(d, N) :=
1
N
4
(d + 2)
1
d
+4
.
(6)This expression, whi h is derived from a Gaussian assumption on
p
X
, isthought to be a good ompromise between omplexity and pre ision.
How-ever,itisgenerallyobservedthat,forxedvaluesof
N
,whenthedistributionof
X
is on entrated on an unknown subset ofR
d
, the more omplex and
dis onne ted this subset, the less relevant the value of
H
Silv
(d, N)
. To fa ethis problem, the diusion maps theory [14℄ an be used to bias the
gener-ation of independent realizations under
p
b
X
(·; H
Silv
(d, N), S(N))
and makethem loser to the ones we ouldhave got if they had been generated under
the true PDF
p
X
. Indeed, diusion maps are a very powerfulmathemati- al tool to dis over and hara terize sets on whi h the distribution of
X
ison entrated,andtheir ouplingtononparametri statisti alrepresentations
has shown promising results, even when dealing with very high values of
d
[2℄.
From another point of view, the likelihood
L(S(N)|H)
asso iated withH
an alsodire tly beused to identify relevant values ofH
. From Eq. (1),it follows that
L(S(N)|H) :=
N
Y
n=1
b
p
X
(X(ω
n
); H, S(N)) =
1
N
N
N
Y
n=1
N
X
m=1
φ
n,m
(H),
(7)φ
n,m
(H) := φ (X(ω
n
); X(ω
m
), H) ,
1 ≤ n, m ≤ N.
(8)The fun tion
L(S(N)|H)
uses twi e the same information(to omputeb
p
X
(·; H, S(N))
andtoevaluateit). Hen e,ittendstoinnitywhenH
tendstozero, whi h an beseen asanoverttingof the availabledata. Inorderto
avoidthisphenomenon,itisproposedin[15℄to onsiderits"Leave-One-Out"
(LOO) expression
L
LOO(S(N)|H) :=
N
Y
n=1
1
N − 1
N
X
m=1,m6=n
φ
n,m
(H)
(9)instead. Given this approximate likelihood obtained from an LOO
ross-validation, and an a priori density
p
H
forH
, Bayesian approa hes an bep
H
(H|S(N)) := c L
LOO(S(N)|H)p
H
(H), H ∈ M
+
(d).
(10)Here,
c
isanormalizing onstantandM
+
(d)
isthesetofall
(d×d)
-dimensionalsymmetri positivedenite matri es. In parti ular,the maximum likelihood
estimate of
H
isdenoted byH
MLE(d, N) := arg max
H
∈M
+
(d)
L
LOO(S(N)|H).
(11)Additionally, onsidering that the best available approximations of the
true mean and ovarian e matrix of
X
are given by their empiri alestima-tions
b
µ
X
:=
1
N
N
X
n=1
X(ω
n
),
b
R
X
:=
1
N − 1
N
X
n=1
(X(ω
n
) −
µ
b
X
) ⊗ (X(ω
n
) −
µ
b
X
),
the expression given by Eq. (1) an be slightly modied to ensure that
the meanand the ovarian e matrix of the G-KDEapproximationof
X
areequal to these estimations. Following [13℄, this an be done by onsidering
the subsequent proposition. The proof is given inAppendix.
Proposition 1. If the PDF of
f
X
is equal toe
p
X
(·; H, S(N)) :=
1
N
N
X
n=1
φ (·; AX(ω
n
) + β, H) ,
(12)β
:= (I
d
− A)b
µ, H := b
R
X
−
N − 1
N
A b
R
X
A
T
,
(13)where
A
is any(d × d)
-dimensional matrix su h thatH
ispositive denite,then the mean and the ovarian e matrix of
f
X
are equal toµ
b
andR
b
X
re-spe tively.
Given
S(N)
, the G-KDEof thePDF ofX
under onstraintsonitsmeanand its ovarian e matrix is denoted by
p
e
X
(·; H
MLE
(d, N), S(N))
. Here,H
MLE(d, N)
is the argument thatmaximizesthe LOO likelihoodofH
asso- iated withp
e
X
.Given
µ
b
,R
b
X
, andH
MLE(d, N)
, the generation of independentrealiza-tions of
f
X
∼
p
e
X
(·; H
MLE
(d, N), S(N))
is straightforward. Indeed, for anyM ≥ 1
, the Algorithm 1 (dened below) an be used to generate a(d × M)
-dimensionalmatrixZ
, whose olumnsare independent realizationsof
f
X
. There,U {1, . . . , N}
denotes the dis rete uniform distribution over{1, . . . , N}
andN (0, 1)
denotes the standard Gaussiandistribution.1 Let
Q(ω
′
1
), . . . , Q(ω
′
M
)
beM
independent realizationsthat are drawn fromU {1, . . . , N}
;2 Let
M
bea(d × M)
-dimensional matrix whose olumnsare all equal toµ
b
;3 Compute
A
su h thatH
:= b
R
X
−
N −1
N
A b
R
X
A
T
; 4 DeneX
¯
:=
X
(ω
Q(ω
′
1
)
) · · · X(ω
Q(ω
′
M
)
)
;5 Let
Ξ
bea(d × M)
-dimensional matrix, whose omponentsaredM
independent realizations that are drawn from
N (0, 1)
;6 Assemble
Z
= M + A( ¯
X
− M ) + H
MLE
(d, N)
1/2
Ξ
.Algorithm 1:Generation of
M
independentrealizations ofX
f
.Finally,this se tionhas presented thegeneralframeworkto
nonparamet-ri ally approximate the PDF of a random ve tor when the maximal
infor-mation is a set of
N
independent realizations. Some adjustments of thelassi alformulationhave been proposed totake intoa ount onstraints on
the rst and se ond statisti al moments of the approximated PDF, and it
has been proposed to sear h the kernel density bandwidthas the solutionof
a omputationallydemanding LOO likelihoodmaximizationproblem.
However, fromtheanalysisof aseriesoftest ases,itappearsthat
R
b
X
isarathergoodapproximationof
H
MLE
(d, N)
forthenonparametri modellingof high dimensionalrandomve tors(
d ∼ 10 − 100
)with limitedinformation(
N ∼ 10d
for instan e). From Eqs. (12) and (13), this means that we areapproximatingthe PDFof
X
asauniqueGaussian PDF,whoseparametersorrespond tothe empiri almean and ovarian ematrix of
X
:lim
H
→ b
R
X
e
p
X
(·; H, S(N)) = φ(·;
µ, b
b
R
X
).
(14)This ould prevent us from re overing the subset of
R
d
smaller values for the omponents of
H
in the nonparametri model. If allthe omponents of
X
are a tually dependent, there is however no reasonto do so without biasing the nal onstru ted distribution in fo using too
mu h onthe available data. Thus, instead ofarti iallyde reasing the most
likelyvalueof
H
(a ordingtothe availabledata),the nextse tionproposesseveral adaptations of this G-KDEformalism.
3. Data-driven tensor-produ t representation
Thisse tionpresentssomeadaptationsofthe lassi alG-KDEtoimprove
the nonparametri representations of
p
X
when the numberN
of availablerealizations of
X
isrelatively small ompared toits dimensiond
. Following[16℄ and [17℄, we rst suggest to pre-pro ess the realizations of
X
(from aPrin ipalComponentAnalysisforinstan e)su hthat
X
isnowsupposed tobe entred and un orrelated:
b
µ
X
= 0,
R
b
X
= I
d
.
Here,
I
d
isthe(d × d)
-dimensionalidentitymatrix. Thismakesindependentthe omponents of
X
that were only linearly dependent. Then, the idea isto identify groups of omponents of
X
that an reasonably be onsideredas statisti ally independent, if they exist. Instead of using statisti al tests,
we propose to sear h these groups by looking for the maximum of a
ross-validationlikelihood quantity that is asso iated with ea h blo k formation.
Thus,givenablo kbyblo kde ompositionofthe omponentsof
X
,thePDFp
X
is approximated as the produ t of the nonparametri estimations of thePDFsasso iatedwithea hsub-ve torof
X
. Forinstan e,ifthed
omponentsof
X
are sorted ind
distin t groups, the approximation ofp
X
orrespondsto the produ t of the
d
nonparametri estimations of the marginal PDFsof
X
. Indeed, if the identied blo k de omposition is orre tly adapted tothe (unknown) dependen e stru ture of
X
, there are good han es for thenonparametri representation of
p
X
to be improved.More details about this blo k de omposition are presented in the rest of
this se tion. First, we introdu e the notations and the formalism on whi h
this de omposition is based. Then, several algorithms are proposed for its
Forany
b
in{1, . . . , d}
d
and for all
1 ≤ i ≤ d
,b
i
an be used as a blo kindex for the
i
th
omponent
X
i
ofX
. This means that ifb
i
= b
j
,X
i
andX
j
are supposed tobedependent and have tobelong tothe same blo k. Onthe ontrary, if
b
i
6= b
j
,X
i
andX
j
are supposed to beindependentand theyan belong totwo dierent blo ks. In order toavoid any redundan y in the
blo k by blo k parametrization of
X
, the following subset of{1, . . . , d}
d
is onsidered:B(d) :=
b
∈ {1, . . . , d}
d
| b
1
= 1, 1 ≤ b
j
≤ 1 + max
1≤i≤j−1
b
i
, 2 ≤ j ≤ d
.
(15)Additionally,for any
b
inB(d)
, let•
Max(b)
be the maximal value ofb
,• s
(ℓ)
(X; b)
bethe randomve tor that gathersallthe omponents of
X
with a blo k index equal to
ℓ
,• d
ℓ
be the numberof elements ofb
that are equal toℓ
,• S
ℓ
(N)
bethesetthatgathersthe
N
independentrealizationsofs
(ℓ)
(X; b)
that havebeen extra ted fromthe
N
independentrealizations ofX
inS(N)
.Thereexists abije tionbetween
B(d)
and thesetofallblo kbyblo kde- ompositionsof
X
. Forinstan e,ford = 5
,alltheelementsof{(i, j, i, k, k), 1 ≤ i 6= j 6= k ≤ 5}
orrespond tothesame blo k de ompositionof
X
, butonlyb
= (1, 2, 1, 3, 3)
is in
B(d)
. We an alsoidentifys
(1)
(X; b) = (X
1
, X
3
),
s
(2)
(X; b) = X
2
,
s
(3)
(X; b) = (X
4
, X
5
),
(16)Max
(b) = 3,
d
1
= 2 d
2
= 1,
d
3
= 2.
(17)A ording to Eq. (12), for any
H
ℓ
inM
+
(d
ℓ
)
, the PDF ofs
(ℓ)
(X; b)
anbe approximated by
p
e
s
(ℓ)
(X;b)
(·; H
ℓ
, S
ℓ
(N))
. It follows that the PDF ofX
e
p
X
(x; H
1
, . . . , H
Max(b)
, S(N), b) :=
Max(b)
Y
ℓ=1
e
p
s
(ℓ)
(X;b)
(s
(ℓ)
(x; b); H
ℓ
, S
ℓ
(N)).
(18)Su h a onstru tionforthe PDFof
X
meansthat the ve torss
(ℓ)
(X; b)
,1 ≤ ℓ ≤
Max(b)
, are assumed to be independent. For anyb
inB(d)
, letH
MLE1
(b), . . . , H
MLEd
(b)
betheargumentsthatmaximizetheLOOlikelihoodasso iated with
p
e
X
. Hen e, for a given blo k by blo k de omposition ofX
that is hara terized by a given value of
b
, the most likely G-KDE ofp
X
isgiven by
e
p
X
(x; H
MLE1
(b), . . . , H
MLEd
(b), S(N), b).
(19) UsingEqs. (9),(12)and(18),foranyb
inB(d)
andany(H
1
, . . . , H
Max
(b)
)
in
M
+
(d
1
) × · · · × M
+
(d
Max(b)
)
, this LOO likelihoodis given by
L
LOO(S(N)|H
1
. . . , H
d
, b) =
Max(b)
Y
ℓ=1
N
Y
n=1
1
N − 1
N
X
m=1,m6=n
e
φ
n,m
(H
ℓ
, b),
(20)e
φ
n,m
(H
ℓ
, b) := φ s
(ℓ)
(X(ω
n
); b); A
ℓ
s
(ℓ)
(X(ω
m
); b), H
(ℓ)
,
(21)H
ℓ
:= I
d
ℓ
−
N − 1
N
A
ℓ
A
T
ℓ
.
(22) Noti ingthatmax
H
1
,...,H
Max(b)
,b
Max(b)
Y
ℓ=1
N
Y
n=1
1
N − 1
N
X
m=1,m6=n
e
φ
n,m
(H
ℓ
, b)
= max
b
Max(b)
Y
ℓ=1
max
H
ℓ
N
Y
n=1
1
N − 1
N
X
m=1,m6=n
e
φ
n,m
(H
ℓ
, b),
(23)itfollows thatforagivenblo kbyblo kde ompositionof
X
,the mostlikelyvalues of
H
1
, . . . , H
Max(b)
an be omputed independently, and saved for
a possible re-use for an other value of
b
. Indeed, ifb
(1)
= (1, 1, 2, 2)
values
H
(1)
1
andH
(1)
2
havetobe hosen forthe bandwidth matri es(one forea hblo k). ThismeansthattwoindependentLOOlikelihoodmaximization
problems have to be solved. In the same manner, if
b
(2)
= (1, 1, 2, 3)
, three valuesH
(2)
1
,H
(2)
2
andH
(2)
3
have tobe hosen. However, given the same setof realizations of
X
, it is lear that the most likely value ofH
(1)
1
is equalto the most likely value of
H
(2)
1
. Hen e, the most likely value ofb
, whi h isdenoted by
b
MLE , is eventually solutionofb
MLE:= arg max
b
∈B(d)
L
LOO(S(N)|H
MLE1
(b), . . . , H
MLEd
(b), b).
(24)There, we remindthat for any
b
inB(d)
and any1 ≤ ℓ ≤
Max(b)
,H
MLEℓ
(b) := arg
H
max
ℓ
∈M
+
(d
ℓ
)
N
Y
n=1
1
N − 1
N
X
m=1,m6=n
e
φ
n,m
(H
ℓ
, b).
(25)Analyzing the value of
b
MLE
an give information on the a tual
depen-den e stru ture for the omponents of
X
. Indeed, ifb
MLE
= (1, . . . , 1)
, themostappropriaterepresentationforthePDFof
X
isits lassi almultidimen-sionalGaussiankernel estimation. Thiswouldmeanthat allthe omponents
of
X
are likely to be dependent. On the ontrary, ifb
MLE
= (1, 2, . . . , d)
,the most likely representation orresponds to the assumption that all the
omponents of
X
are independent. Other values ofb
MLE
an also be used
to identify groups of dependent omponents of
X
, whi h are likely to beindependent the ones to the others.
3.2. Pra ti alsolving of the blo k byblo k de omposition problem
The optimization problem dened by Eq. (24) being very omplex, we
suggest to sear h the most likely blo k by blo k de omposition of
X
usingverysimpleparametrizationsofthebandwidthmatri es. Indeed,on eve tor
X
has been entred and un orrelated, it is reasonable to parametrize ea hbandwidth matrix
H
ℓ
by a unique s alarh
ℓ
, su h thatH
ℓ
= h
2
ℓ
I
d
ℓ
. FromEq. (22), it follows that
A
ℓ
=
N
N − 1
q
1 − h
2
ℓ
I
d
ℓ
.
(26)Hen e,foragiven pre ision
ǫ
,the omplexproblemofsear hing themostlikely values of
H
1
, . . . , H
Max(b)
an be redu ed to minimizingMax
(b)
nonvalue of
d
1 2 3 4 5 6 7 8 9 10 value ofN
B(d)
1 2 5 15 52 203 877 4140 21147 115975 value ofN
max greedy(d)
1 3 8 17 31 51 78 113 157 211Table1: Evolutionof
N
B(d)
andN
maxgreedy
(d)
withrespe tto
d
.in parallel, and ea h minimization problem an be solved very e iently
using a ombinationof golden se tion sear h and su essive paraboli
inter-polations (see [1℄ for further details about this method). However, solving
the optimization problem dened by Eq. (24) an still be omputationally
demandingwhen
d
in reases. Indeed,asit anbeseeninTable1,thenumberof admissiblevalues of
b
, whi h isdenoted byN
B(d)
, in reases exponentiallywith respe t to
d
. Hen e, a brute for e approa h, whi h would onsist intesting allthe possible values of
b
, an not be used to identifyb
MLE .
Asanalternative,weproposeto onsideragreedyalgorithm,whose
om-putational ost an bebounded. Startingfroma ongurationwhere allthe
omponentsof
X
areinthe sameblo k,whi h orrespondstob
= (1, . . . , 1)
,the idea of this algorithmis toremove iteratively one element of this initial
blo k, and to put it in a blo k that would be already built, or in a new
blo k where it is the only element. The Algorithm 2 provides a more
de-tailed des ription of this pro edure. By onstru tion, the number
N
greedy
(d)
of evaluations ofb
7→ max
h
L
LOO(S(N)|b, h)
veriesN
greedy(d) ≤ N
max greedy(d) := 1 +
d−2
X
i=0
(d − i)(i + 1) ≤ d
3
.
(27)For
d > 4
,su hanalgorithm an therefore beused to approximateb
MLEat a omputational ost that ismu hmore aordablethan a dire t
identi- ation based on
N
B(d)
evaluations ofb
7→ max
h
L
LOO
(S(N)|b, h)
.When modelling high dimensional random ve tors (
d ∼ 50 − 100
), thevalue of
N
max
greedy
(d)
, whi h is denitely mu h smaller than
N
B(d)
, an alsobe omevery high:
N
maxgreedy
(d = 50) = 22051, N
max
greedy
(d = 100) = 171601.
(28)
To identify relevant values for
b
at a lower omputational ost in su h1 Initialization:
b
∗
= (1, . . . , 1)
,ind.blo ked= ∅
; 2 fork = 1 : d
do 3L
(k)
= ∅
,b
(k)
= ∅
, index(k)
= ∅
,ℓ = 1
; 4 fori ∈ {1, . . . , d} \
ind.blo ked do 5 forj = 2 : min(d,
Max(b
⋆
) + 1)
do6 Adapt the value of the blo k index:
b
temp
:= b
∗
,b
tempi
= j
; 7 Compute:L
temp= max
h
L
LOO(S(N)|b
temp, h)
; 8 Save results:L
(k)
{ℓ} = L
temp ,b
(k)
{ℓ} = b
temp , index(k)
{ℓ} = i
; 9 In rement:ℓ ← ℓ + 1
; 10 end 11 end12 Find the best blo k index at iteration
k
:ℓ
∗
= arg max
ℓ
L
(k)
{ℓ}
;13 A tualize:
b
∗
← b
(k)
{ℓ
∗
}
, ind.blo ked
←
ind.blo ked∪
index(k)
{ℓ
∗
}
;14 end
15 Maximize over alliterations:
(ℓ
greedy
, k
greedy) := arg max
ℓ,k
L
(k)
{ℓ}
; 16 Approximateb
MLE≈ b
(k
greedy)
ℓ
greedy.
Algorithm 2:Greedy sear h of
b
MLE .
algorithmsto the ase of the identi ationof the most likely blo k by blo k
de omposition of
X
is proposed. The fusionand the mutationpro esses onwhi h su h algorithms are generally based, as well as a pseudo-proje tion
in
B(d)
are therefore detailedin Appendix. In these algorithms, for any setS
(whi h an be dis rete or ontinuous), we denote byU(S)
the uniformdistribution over
S
. Based on these three fun tions, the Algorithm 3 showsthe geneti pro edure we suggest for solving Eq. (24). The results given by
this geneti algorithmare dependent onthree parameters:
•
the maximumnumberof iterationsi
max ,
•
the probabilityof mutationp
Mut ,
•
the size of the population we are onsidering in the geneti algorithmN
pop .Forthisalgorithm,thenumberofevaluationsof
b
7→ max
h
L
LOO
(S(N)|b, h)
is equal toN
tot= i
max× N
pop . Fora given value ofN
tot , it is however hardtoinferthe optimalvaluesforthesethreeparameters,asitdependson
d
andon the optimal blo k-by-blo k stru ture of the onsidered random ve tor of
interest. However, from the analysis of a series of numeri al examples, it is
generally interesting to hoose small values for
p
Mut
to limitthe number of
spontaneous mutations, and favour high values for the numberof iterations
i
maxratherthan for the populationsize
N
pop .
On e a satisfying value
bb
MLE
of
b
has been identied using the s alarparametrizationofthebandwidthmatri es,itispossibletoenri hthe
parametriza-tion ofthe bandwidthmatri estoimprovethe nonparametri representation
of the PDFof
X
. This amountsat solvingH
MLEℓ
(b
b
MLE) = arg
max
H
ℓ
∈M
+
(d
ℓ
)
1
N − 1
N
X
m=1,m6=n
e
φ
n,m
(H
ℓ
, b
b
MLE)
(29)for all
1 ≤ ℓ ≤
Max(b
b
MLE)
. In pra ti e, we observed on a series of testases that the interest of su h an enri hment of the bandwidth matrix was
1 Choose
N
pop≥ 2
,0 ≤ p
Mut≤ 1
andi
max≥ 1
; 2 Initialization; 3 DeneB = ∅
,L = ∅
, in= 1
; 4 Choose atrandomN
pop elements ofB(d)
,n
b
(1)
, . . . , b
(N
pop)
o
; 5 for n = 1 :N
pop do 6 Compute:L
temp= max
h
L
LOO(S(N)|b
(n)
, h)
;7 Save results:
L {
in} = L
temp ,
B {
in} = b
(n)
, in=
in+ 1
; 8 end 9 Iteration ; 10 for i = 2 :i
max do 11 GatherinS
theN
popelements of
B
asso iated with theN
pop
highest
values of
L
;12 Choose atrandom
N
pop
distin tpairs of elementsof
S
:n
b
(n,1)
, b
(n,2)
, 1 ≤ n ≤ N
popo
; 13 for n = 1 :N
pop do 14 Fusion:b
Fus=
Fusion(b
(n,1)
, b
(n,2)
)
; 15 Mutation:b
Mut=
Mutation(b
Fus, p
Mut)
; 16 Compute:L
temp= max
h
L
LOO(S(N)|b
Mut, h)
;17 Save results:
L {
in} = L
temp ,
B {
in} = b
Mut , in=
in+ 1
; 18 end 19 end20 Maximize over alliterations:
k
gene
= arg max
1≤k≤
in−1
L {k}
;21 Approximate
b
MLE
≈ B {k
gene}
.Algorithm 3: Geneti sear h of
b
MLE
. The fun tions Mutation() and
Fusion() are presented in Appendix, and are detailed in Algorithms 4
The purpose of this se tion is toillustrate the interest of the orrelation
onstraints and the tensorized formulation for the nonparametri
represen-tation of PDFs when the maximal informationis a nite set of independent
realizations. To this end, a series of examples will be presented. The rst
examples will be based on generated data, so that the errors an be
on-trolled,whereas the lastexamplepresentsanindustrialappli ationbased on
experimentaldata.
4.1. Monte Carlo simulation studies
4.1.1. Lemnis atefun tion
Let
U
bearandomvaluethatisuniformlydistributedon[−0.85π, 0.85π]
,ξ
= (ξ
1
, ξ
2
)
bea2-dimensionalrandomve tor whose omponentsare twoin-dependent standard Gaussianvariables,and
X
L
= (X
L1
, X
L2
)
be the random ve tor sothatX
L=
sin(U)
1 + cos(U)
2
,
sin(U) cos(U)
1 + cos(U)
2
+ 0.05ξ.
(30)Weassumethat
N = 200
independentrealizationsofX
L
havebeen
gath-ered in
S(N)
. Given this information, we would like to generate additionalpoints that ould sensibly be onsidered as new independent realizations of
X
L. Based on the G-KDE formalism presented in Se tion 2, four kinds of
generators are ompared in Figure 1, depending on the value of the
band-width and onthe onstraintson the statisti almoments of
X
L .
•
Case 1:p
X
L is approximated byp
c
X
L(·; (h
Silv(d, N))
2
I
d
, S(N))
, whi his dened by Eq. (1) (no onstraints).
•
Case 2:p
X
L is approximated byp
f
X
L(·; (h
Silv(d, N))
2
I
d
, S(N))
, whi his dened by Eq. (12) ( onstraints onthe mean and the ovarian e).
•
Case3:p
X
Lisapproximatedbyp
c
X
L(·; (h
MLE(d, N))
2
I
d
, S(N))
(no on-straints).•
Case 4:p
X
L is approximated byp
f
X
L(·; (h
MLE(d, N))
2
I
d
, S(N))
The relevan e of the dierent approximations of
p
X
L an be analysedfrom a graphi al point of view inFigure 1. It is instru tive to ompare the
asso iatedvaluesoftheLOOlikelihood,whi hisdenotedby
L
LOO
(S(N)|H)
,as the higher this value, the more likely the approximation. Hen e, for this
example, introdu ing onstraints onthe mean and the ovarian e of the
G-KDE tends to slightly in rease the values of
L
LOO
(S(N)|H)
. Moreover,these results are strongly improved when hoosing
h
MLE
(d, N)
instead ofh
Silv(d, N)
. Then, for these four ases, Figure 2 ompares the evolution ofh
Silv
(d, N)
andh
MLE(d, N)
with respe t toN
, and shows the asso iatedvalues of the LOO likelihood. For this example, it an therefore be seen
that
h
Silv
(d, N)
strongly overestimates the s attering of the distribution ofX
L, for any onsidered values of
N
. This is not the ase when workingwith
h
MLE
(d, N)
. It is also interesting to noti e that for values ofN
lowerthan
10
4
(whi his veryhigh for2-dimensional ases),the dieren ebetween
h
MLE(d, N)
andh
Silv(d, N)
is always important.4.1.2. Fourbran hes lover-knot fun tion
Inthesamemannerthaninthepreviousse tion,let
U
bearandomvaluethat is uniformly distributed on
[−π, π]
,ξ
= (ξ
1
, ξ
2
, ξ
3
)
be a 3-dimensionalrandom ve tor whose omponentsare three independent standard Gaussian
variables,and
X
FB
be the random ve tor so that
X
FB= (cos(U) + 2 cos(3U), sin(U) − 2 sin(3U), 2 sin(4U)) + ξ.
(31)On eagain,startingfromadatasetof
N = 200
independentrealizations,we would like to be able to generate additional realizations of
X
FB . For
this 3-dimensional ase, as in the previous se tion, Figures 3 and 4 allow
us to underline the interest of onsidering G-KDE representations that are
onstrained in terms of mean and ovarian e, for whi h the bandwidths are
optimized from the likelihoodmaximizationpointof view.
4.1.3. Interest of the blo k-by-blo k de omposition in higher dimensions
As explained in Se tion 3, when
d
is high, the G-KDE ofp
X
requiresvery high values of
N
to be able to identify the manifold on whi h thedis-tribution of
X
is on entrated. In other words, ifN
is xed, the higherd
,the higher
h
MLE
(d, N)
andthe more s atteredthe newrealizations ofX
. Asanillustrationofthis phenomenon, letus onsider thetwofollowing random
−2
2
−0.5
0.5
PSfrag repla ementsx1
x
2
(a)log(L
LOO(S(N )|h
2
I
d
)) = −147
−2
2
−0.5
0.5
PSfrag repla ementsx1
x
2
(b)log(L
LOO(S(N )|h
2
I
d
)) = −142
−2
2
−0.5
0.5
PSfrag repla ementsx1
x
2
( )log(L
LOO(S(N )|h
2
I
d
)) = −39.0
−2
2
−0.5
0.5
PSfrag repla ementsx1
x
2
(d)log(L
LOO(S(N )|h
2
I
d
)) = −38.7
Figure 1: Lemnis ate ase:
N
= 200
givendata points(big bla k squares) and10
4
ad-ditional realizations(smallred andgreenpoints) generatedfrom aG-KDEapproa hfor
h
= h
Silv(d, N )
(rst row)andh
= h
MLE
(d, N )
(se ond row). The rst olumnorre-spondstothe asewhereno onstraintsonthemeanandthe ovarian eofthegenerated
pointsareintrodu ed,whereasthese ond olumn orrespondstothe asewherethemean
and the ovarian eof thegenerated points areequal to theirempiri al estimations that
are omputedfromtheavailabledata. Under ea h graphis shownthevalueofthe LOO
10
2
10
3
10
4
10
−1
10
0
PSfrag repla ementsN
h
(a) Evolutionofthebandwidth
0
2000
4000
6000
8000
10000
−3000
−2500
−2000
−1500
−1000
−500
0
PSfrag repla ementslo
g
(L
LOO(S
(N
)|
h
2
I
d
))
N
(b)EvolutionofL
LOO(S(N )|h
2
I
d
)
Figure2:Evolutionofthebandwidth(left)andoftheLOO-likelihood(right)withrespe t
toN fortheLemnis ate fun tion(2D).Thereddottedlines orrespondto theSilverman
ase:
h
= h
Silv(d, N )
. Thebla ksolidlines orrespondtotheMLE ase:h
= h
MLE
(d, N )
.Forthis 2D example, the distin tions between the ases with orrelation onstraints or
withoutwerenegligible ompared to thedieren ebetweentheSilvermanand theMLE
ases. Hen e,onlythe aseswhere orrelation onstraintsareimposedontheG-KDEare
represented. Ea h urve orrespondstothemeanvaluesof
h
andlog(L
LOO(S(N )|h
2
I
d
))
, whi h havebeen omputed from50 independentgenerated200-dimensional setsofinde-pendentrealizationsof
X
L .•
Case 1:X
(2D)
= (X
L, Ξ
3
, . . . , Ξ
d
)
.•
Case 2:X
(3D)
= (X
FB, Ξ
4
, . . . , Ξ
d
)
.Here,
Ξ
3
, . . . , Ξ
d
denoted
independent standard Gaussian randomvari-ables, whereas the random ve tors
X
L
and
X
FB
have been introdu ed in
Se tion 4.1. Forthese two ases, two ongurations are ompared.
•
On the rst hand, a lassi alG-KDE ofthe PDFsofX
(2D)
and
X
(3D)
is omputed. In that ase, noblo k de ompositionis arried out. The
blo k by blo k ve tors asso iated with these modelling, whi h are
re-spe tively denoted by
b
(2D,1)
and
b
(3D,1)
, are equalto
(1, . . . , 1)
.•
Onthese ondhand,weimposeb
(2D,2)
= (1, 1, 2, . . . , d−1)
and
b
(3D,2)
=
(1, 1, 1, 2, . . . , d − 2)
,and webuild theasso iatedtensorized versionsofthe G-KDE of the PDFs of
X
(2D)
and
X
(3D)
.Hen e,when noblo k de omposition is arriedout, we anverify in
Fig-ure 5 that
h
MLE
−5
5
−5
5
−5
5
PSfrag repla ementsx1
x2
x
3
(a)log(L
LOO(S(N )|h
2
I
d
)) = −1.102 × 10
3
−5
5
−5
5
−5
5
PSfrag repla ementsx1
x2
x
3
(b)log(L
LOO(S(N )|h
2
I
d
)) = −1.101 × 10
3
−5
5
−5
5
−5
5
PSfrag repla ementsx1
x2
x
3
( )log(L
LOO(S(N )|h
2
I
d
)) = −8.00 × 10
2
−5
5
−5
5
−5
5
PSfrag repla ementsx1
x2
x
3
(d)log(L
LOO(S(N )|h
2
I
d
)) = −7.98 × 10
2
Figure 3: Fourbran hes lover-knot ase:
N
= 200
givendatapoints(bigbla ksquares) and10
4
additional realizations (small red and green points) generated from a G-KDE
approa h for
h
= h
Silv
(d, N )
(rst row) andh
= h
MLE
(d, N )
(se ond row). The rstolumn orresponds tothe asewhere no onstraintsonthemeanand the ovarian eof
thegenerated pointsare introdu ed, whereasthese ond olumn orrespondstothe ase
where the mean and the ovarian e of thegenerated pointsare equalto theirempiri al
estimations that are omputed from the available data. Under ea h graphis shownthe
10
2
10
3
10
4
10
−1
10
0
PSfrag repla ementsN
h
(a) Evolutionofthebandwidth
0
2000
4000
6000
8000
10000
−5
−4
−3
−2
−1
0
x 10
4
PSfrag repla ementslo
g
(L
LOO(S
(N
)|
h
2
I
d
))
N
(b)EvolutionofL
LOO(S(N )|h
2
I
d
)
Figure4:Evolutionofthebandwidth(left)andoftheLOO-likelihood(right)withrespe t
to Nforthefour bran hes lover-knotfun tion(3D).Thereddotted lines orrespondto
the Silverman ase:
h
= h
Silv(d, N )
. The bla ksolid lines orrespond to theMLE ase:h
= h
MLE(d, N )
. Forthis3Dexample,thedistin tionsbetweenthe aseswith orrelationonstraintsorwithoutwere negligible ompared to thedieren ebetweentheSilverman
and the MLE ases. Hen e, only the ases where orrelation onstraints are imposed
on the G-KDE are represented. Ea h urve orresponds to the mean values of
h
andlog(L
LOO(S(N )|h
2
I
d
))
, whi h havebeen omputed from 50independent generated 200-dimensional setsofindependentrealizationsofX
KB .
onsidered ases. Asa onsequen e, the apa ity ofthe lassi alG-KDE
for-malismto on entratethenewrealizationsof
X
(2D)
andX
(3D)
onthe orre t subspa es ofR
d
de reases when
d
in reases. To illustratethis phenomenon,for
N = 500
, Figure6 ompares the positionsof the rst omponents of theavailable realizations of
X
(2D)
and
X
(3D)
, and the orresponding positions
of
10
4
additionalpoints generated froma G-KDE approa h. Hen e, just by
working onthe optimization of the value of the bandwidth,it isqui kly
im-possibleto re over the subsets of
R
2
and
R
3
on whi hthe true distributions
of
X
L
and
X
FB
are on entrated. Onthe ontrary,whenthe blo k by blo k
de ompositions given by
b
(2D,2)
and
b
(3D,2)
are onsidered, the
approxima-tion of the PDFs of the two rst omponents of
X
(2D)
and
X
(3D)
is not
ae ted by the presen e of the additionalrandomvariables
Ξ
3
, . . . , Ξ
d
. Asaonsequen e, for ea h onsidered values of
d
, the new generated points areon entrated onthe orre tsubspa es, as it an be seen inFigure 6.
Atlast, thehighinterestofintrodu ingtheblo kbyblo kde omposition
for thesetwo examplesisemphasizedby omparinginFigure5the values of
5
10
15
20
0
0.2
0.4
0.6
0.8
1
PSfrag repla ementsd
h
MLE(d
,N
)
(a) Lemnis atefun tion
5
10
15
20
0
0.2
0.4
0.6
0.8
1
PSfrag repla ementsh
MLE(d
,N
)
d
(b) Fourbran hes lover-knotfun tion
5
10
15
20
−14000
−12000
−10000
−8000
−6000
−4000
−2000
0
PSfrag repla ementsd
lo
g
(L
LOO(S
(N
)|
h
2
I
d
,b
))
b
= b
(2D,1)
b
= b
(2D,2)
( ) Lemnis atefun tion5
10
15
20
−16000
−14000
−12000
−10000
−8000
−6000
−4000
−2000
0
PSfrag repla ementslo
g
(L
LOO(S
(N
)|
h
2
I
d
,b
))
d
b
= b
(2D,1)
b
= b
(2D,2)
(d) Fourbran hes lover-knotfun tion
Figure5: Evolutionsof
h
MLE(d, N )
andL
LOO(S(N )|(h
MLE(d, N ))
2
I
d
)
withrespe ttod
,for
N
= 500
. Theleft olumn orrespond to the modelling ofX
(2D)
, whereasthe right
olumn orrespondstothemodellingof
X
(3D)
−2
2
−0.5
0.5
PSfrag repla ementsd = 3
x1
x
2
(a)b
= b
(2D,1)
−5
5
−5
5
−5
5
z
PSfrag repla ementsd = 4
x1
x2
(b)b
= b
(3D,1)
−2
2
−0.5
0.5
PSfrag repla ementsd = 5
x1
x
2
( )b
= b
(2D,1)
−5
5
−5
5
−5
5
z
PSfrag repla ementsd = 6
x1
x2
(d)b
= b
(3D,1)
−2
2
−0.5
0.5
PSfrag repla ementsd = 20
x1
x
2
(e)b
= b
(2D,2)
−5
5
−5
5
−5
5
z
PSfrag repla ementsd = 20
x1
x2
(f)b
= b
(3D,2)
Figure6: Comparisonbetweenthepositionsof
N
= 500
givenvaluesofX
(2D)
and
X
(3D)
(bigbla ksquares)andthepositionsof10
4
additionalvalues(smallredpoints)generated
from aG-KDE approa h for several values of
d
and two ongurations of the blo k byblo k de omposition for theG-KDE of thePDF of
X
(2D)
and
X
(3D)
. The left olumn
orresponds to the modelling of
X
(2D)
, whereas the right olumn orresponds to the
modellingof
X
(3D)
value of
b
value ofh
MLE(d, N)
log(L
LOO(S(N)|h
MLE(d, N), b))
(1,1,1,1,1) 0.395−1.21 × 10
3
(1,2,3,4,5) (0.115,0.163,0.0971,0.108,0.118)−1.15 × 10
3
(1,2,1,2,1) (0.290,0.226)−1.19 × 10
3
(1,1,2,2,2) (0.113,0.140)−8
.35 × 10
2
(1,1,2,2,3) (0.113,0.119,0.118)−9.96 × 10
2
Table 2: Inuen e of the hoi e of
b
on the LOO log-likelihood of the G-KDE for themodelingofthePDF of
X
= (X
L
,
X
FB)
withN
= 200
independentrealizations.In the same manner, if we dene
X
as the on atenation ofX
L and
X
FB, whi h are hosen independent, the interest of introdu ing the orre t
blo k by blo k de omposition of
X
in terms of likelihood maximization isshown in Table 2. Indeed, hoosing
b
= (1, 1, 2, 2, 2)
instead of the twolassi al a priori hoi es
b
= (1, 1, 1, 1, 1)
(all the omponents are modelledat the same time) and
b
= (1, 2, 3, 4, 5)
(all the omponents are modelledseparately), allows ustostrongly in rease the likelihoodasso iated with the
approximation of the PDF of
X
. Re ipro ally, su h an example seems toonrm the fa t that maximizing
L
LOO
(S(N)|h
MLE(d, N), b)
shouldhelp usto nd the dependen e stru ture in the omponents of
X
.4.1.4. E ien y of the proposed algorithms for the blo k-by-blo k
de ompo-sition
Thisse tion aims at omparingthe e ien y of the proposed algorithms
for solving the optimization problem given by Eq. (24). To this end, using
the same notations than in Se tion 3.1, we denote by
X
the randomve torsu h that for all
1 ≤ ℓ ≤
Max(b)
,s
(ℓ)
(X, b) = ξ
(ℓ)
/
ξ
(ℓ)
+ 0.15Ξ
(ℓ)
.
(32)Here
ξ
(ℓ)
and
Ξ
(ℓ)
denote independent standard Gaussian random ve tors,
and
k·k
denotes the lassi al Eu lidean norm. By onstru tion, the randomve tors
s
(ℓ)
(X, b)
are on entratedon
d
ℓ
-dimensionalhyper-spheres,d
ℓ
beingthe dimensionof
s
(ℓ)
(X, b)
. Thus, randomve tor
X
presents aknown blo kby blo k stru ture, and its distributionis on entrated ona subset of
R
d
.Chosen values of
b
dN
B(d)
N
greedy(d)
N
b
(10,0.01)
gene(d)
(1,2,2,1,3,4,1) 7 877 68 (52) 32.4 (1,2,2,1,3,4,1,2,4,5) 10 115975 174 (141) 25.5 (1,2,2,1,3,4,1,2,4,5,5,6,3,4,7,6,8,1,2,7) 205.17 × 10
17
968 (879) 51.1Table 3: Comparison of the e ien y of the greedy and the geneti algorithms for the
identi ationoftheblo k-by-blo kstru tureof
X
.p
Mut\ N
pop 2 5 10 20 50 0∞
13.5 16.0 35.8 58.1 0.005∞
15.9 19.5 39.7 77.7 0.01 13.4 15.7 25.5 36.0 62.9 0.1 22.9 44.3 41.0 64.7 78.7Table4: Inuen eoftheparameters
p
Mutand
N
pop
onthemeannumberoftestedvalues
of
b
,whi hisdenotedbyN
b
(Np op,p
Mut)
gene
(d)
.
500
independentrealizationsofX
. Fordierentvaluesofd
andb
,theabilityof the greedy and the geneti algorithms to nd ba k the orre t blo k by
blo kstru tureof
X
is omparedinTable3. Inthistable,N
pop
= 10
,p
Mut=
0.01
, and we denote byN
b
(N
pop,p
Mut)
gene(d)
the mean number of distin t values
of
b
that were tested for the geneti algorithmto identify the optimal valueof
b
. These values were omputed from 20 runs of the algorithm initializedin 20dierent initialpopulations hosen at randomin
B(d)
. For the greedyase, the algorithm,whi h isdeterministi ,was run untilit stopped, and we
indi ate inTable 3 two quantities: the totalnumberof iterations
N
greedy
(d)
,and, in parenthesis, the number of iterations that was a tually needed to
get the best value of
b
. Hen e, for these parti ular examples, the genetialgorithmwas more e ient than the greedy one.
Theinuen eofparameters
N
pop
and
p
Mut
isthenanalysedinTable4,for
d = 10
andb
= (1, 2, 2, 1, 3, 4, 1, 2, 4, 5)
. AvalueofN
b
(N
pop,p
Mut)
gene(d)
equalto∞
meansthatthe orre tvaluewasneverfound after
10
5
iterations. Therefore,
this example(the same thing was observed for the other exampleswe tried)
seem to en ourage the use of small(but not zero) values of
p
Mut
, as well as
smallvalues of
N
pop
The me hani al behaviour of the railway tra k strongly depends on the
tra k superstru ture and substru ture omponents. In parti ular, the
me- hani al properties of the ballast layer are very important. Therefore, a
series of studies are in progress to better analyse the inuen e of the
bal-lastshapeonthe railway tra k performan e. In that prospe t, the shapesof
N = 975
ballastgrainshavebeenmeasuredverypre isely. Asanillustration,Figure 7 shows the s ans of three ballast grains. These measurements an
be onsidered as independent realizations of a omplex randomeld. From
thisnitesetofrealizations,aKarhunen-Loèveexpansion(see [9,8℄formore
details about this method) has been arriedout toredu e the statisti al
di-mension of this random eld. Without entering too mu h into details, we
admit in this paper that the random eld asso iated with the varying
bal-last shape an nallybe parametrizedby a
117
-dimensional randomve tor,whi h is denoted by
X
. As a onsequen e of the Karhunen-Loèveexpan-sion, this randomve tor is entred and its ovarian e matrix is equalto the
117
-dimensionalidentity matrix:E [X] = 0,
E[X ⊗ X] = I
117
.
(33)From the experimental data, we have a ess to
N = 975
independentrealizationsof
X
,whi hare gatheredinS(N)
. Basedonthismaximalavail-able information, we would like to identify the PDF of
X
from a G-KDEapproa h. The results asso iated with several modellings based on the
G-KDE formalismaresummarizedinTable5. In thistable, wenoti ethe high
interest of introdu ing orrelation onstraints. Indeed, for su h a very high
dimensional problem with relatively little data, if no onstraints are
intro-du ed, we get very poormodels asso iated with very low values of the LOO
likelihood. In that ase, assuming that all the omponents are independent
leads to better results than assumingthat they are alldependent. This an
beexplainedby the fa tthat if allthe omponentof
X
are hosenindepen-dent, we impose a diagonalstru ture for
E[X ⊗ X]
, whi h is, in that ase,very lose toimposing that
E[X ⊗ X] = I
117
.On the ontrary, mu h higher values of the LOO likelihoodare obtained
by adding onstraints on the mean value and the ovarian e matrix of the
G-KDE of the PDF of
X
. In both ases, it an be noti ed that it is worthworking on the values of the bandwidth. Indeed, passing from
h
Silv
(d, N)
to
h
MLE
Value of
b
Value ofh
Correlation onstraints LOO Log-likelihood(1, . . . , 1)
h
Silv(d, N)
no -179379(1, . . . , 1)
h
MLE(d, N)
no -176886(1, . . . , d)
h
MLE(d, N)
no -162398(1, . . . , 1)
h
Silv(d, N)
yes -161745(1, . . . , 1)
h
MLE(d, N)
yes -161262(1, . . . , d)
h
MLE(d, N)
yes -161775b
M LE
h
MLE(d, N)
yes -160930Table 5: Inuen e of the value of the bandwidth, of the presen e of onstraintson the
ovarian e,and ofthe hoi eof theblo kbyblo kde omposition fortheapproximation
ofthePDFof
X
.At last, introdu ingthe tensorized representation as it is done in Se tion3,
and working onthe value ofthe blo k-by-blo k de ompositionof
X
leads toanotherhighin reaseoftheLOOlikelihood. Forthisappli ation,thevalueof
b
M LE
has beenapproximatedfromthe ouplingofthe greedyalgorithmandthe geneti algorithmpresented inSe tion3. Thegreedy algorithmwas rst
laun hed, and stopped after 30000 iterations. Then, based on these results,
additional20000iterationswere performedusing the geneti algorithmwith
N
pop= 500
and
p
Mut
= 0.005
.Finally,by working on both the orrelation onstraints and the blo k by
blo k de ompositionof
X
, it ispossibleto onstru t, for this example,veryinteresting statisti al models for
X
. Su h models an then be used for theThiswork onsiders the hallengingproblemofidentifying omplexPDFs
when the maximalavailableinformationisaset ofindependentrealizations.
In that prospe t, the multidimensional G-KDE method plays a key role,
as it presents a good ompromise between omplexity and e ien y. Two
adaptationsofthis methodhavebeen presented. First,amodiedformalism
is presented to make the mean and the ovarian e matrix of the estimated
PDF equal to their empiri alestimations. Then, tensorized representations
are proposed. These onstru tionsare based onthe identi ationof a blo k
byblo kdependen estru tureoftherandomve torsofinterest. Theinterest
of these two adaptations has nallybeen illustrated ona series ofanalyti al
examples and ona high-dimensionalindustrialexample.
Theidenti ationofthe bandwidthmatri esandoftheblo kstru tureis
arriedoutinthe frequen ydomain. InvestigatingBayesiansamplingforthe
bandwidthmatri esandtheblo kstru turesele tion ouldbeinterestingfor
future work.
Appendix
A1. Proof of Proposition 1
We an al ulate:
E
h
f
X
i
=
1
N
N
X
n=1
AX
(ω
n
) + β =
µ.
b
(34) Cov(f
X) =
Z
R
d
x
⊗ x e
p
X
(x; H, S(N))dx −
µ
b
⊗ b
µ
=
1
N
N
X
n=1
H
+ (AX(ω
n
) + β) ⊗ (AX(ω
n
) + β) −
µ
b
⊗ b
µ
= H +
N − 1
N
A b
R
X
A
T
= b
R
X
.
(35)This se tion presents the three algorithms that are used in the geneti
algorithm dened in Se tion 3. Algorithm 4 presents the fusion fun tion,
Algorithm 5 des ribes the mutation fun tion, and Algorithm 6 shows the
pseudo proje tion on
B(d)
on whi h they are based.1 Let
b
(1)
andb
(2)
be two elements ofB(d)
; 2 Initialization:b
= (0, . . . , 0)
, index= {1, . . . , d}
,n = 1
;3 while index isnot empty do
4 Choose
i ∼ U({
index})
,j ∼ U({1, 2})
,k ∼ U({1, 2})
;5 Find
u
(1)
=
whi h(b
(1)
== b
(1)
i
)
,u
(2)
=
whi h(b
(2)
== b
(2)
i
)
; 6 if k==1 then 7 Denev
= u
(j)
∩
index ; 8 end 9 else 10 Denev
= (u
(1)
∪ u
(2)
) ∩
index ; 11 end 12 Fillb[v] = n
;13 A tualize
n ← n + 1
, index←
index\v
.14 end
15 Fusion
(b
(1)
, b
(2)
) := Π
B(d)
(b)
.Algorithm 4: Algorithm for the fusion of two elements
b
(1)
andb
(2)
ofB(d)
. Referen es[1℄ R. Brent. Algorithms for Minimization without Derivatives. Englewood
Clis N.J.: Prenti e-Hall, 1973.
[2℄ R.R. Coifman,S. Lafon,A.B. Lee, M.Maggioni, B. Nadler,F. Warner,
and S.W. Zu ker. Geometri diusions as a tool for harmoni analysis
and stru ture denition of data: diusion maps. Pro .Natl. A ad. S i.
USA, 2005.
[3℄ Tarn Duong, Arianna Cowling, Inge Ko h, and M. P. Wand. Feature
signi an e for multivariate kernel density estimation. Computational
1 Let
b
be anelement ofB(d)
and0 ≤ p
Mut≤ 1
; 2 fori = 1 : d
do 3 Chooseu ∼ U([0, 1])
; 4 ifu < p
Mut then 5b
i
∼ U({1, . . . , d})\ {b
i
}
; 6 end 7 end 8 Mutation(b, p
Mut) := Π
B(d)
(b)
.Algorithm 5:Algorithm for the mutation of anelement
b
ofB(d)
.1 Let
b
bean element of{1, . . . , d}
d
, index= (1, . . . , 1)
,n = 1
,b
∗
= (0, . . . , 0)
; 2 fori = 1 : d
do 3 if sum(index)==0 then 4 break ; 5 end 6 else 7 Findu
=
whi h(b == b
i
)
; 8 Fillb
∗
[u] = n
;9 A tualize
n = n + 1
,index[u] = 0
.10 end 11 end 12