HAL Id: hal-00957880
https://hal.archives-ouvertes.fr/hal-00957880
Preprint submitted on 11 Mar 2014
HAL
is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from
L’archive ouverte pluridisciplinaire
HAL, estdestinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de
Selection of GLM mixtures: a new criterion for clustering purpose
Olivier Lopez, Milhaud Xavier
To cite this version:
Olivier Lopez, Milhaud Xavier. Selection of GLM mixtures: a new criterion for clustering purpose.
2014. �hal-00957880�
Selection of GLM mixtures: a new criterion for clustering purpose
Olivier Lopez
a,b,c, Xavier Milhaud
a,b,∗aENSAE ParisTech, 3 Avenue Pierre Larousse, 92245 Malakoff Cedex, France
bCREST (LFA lab), 15 Boulevard Gabriel P´eri, 92245 Malakoff Cedex, France
cSorbonne Universit´es, UPMC Universit´e Paris VI, EA 3124, LSTA, 4 place Jussieu 75005 Paris, France
Abstract
Model-based clustering from finite mixtures of generalized linear models is a challenging issue which has undergone many recent developments (Hen- nig and Liao (2013), Hannah, Blei, and Powell (2011), Aitkin (1999)). In practice, the model selection step is usually performed by using AIC or BIC penalized criteria. Though, simulations show that they tend to overesti- mate the actual dimension of the model. These evidence led us to consider a new criterion close to ICL, firstly introduced in Baudry (2009). Its defi- nition requires to introduce a contrast embedding an entropic term: using concentration inequalities, we derive key properties about the convergence of the associated M-estimator. The consistency of the corresponding classi- fication criterion then follows depending on some classical requirements on the penalty term. Finally a simulation study enables to corroborate our the- oretical results, and shows the effectiveness of the method in a clustering perspective.
Keywords: Conditional classification likelihood, GLM, Model selection.
1. Introduction
The use of mixture modeling has boomed since the publication of Demp- ster, N.M., and D.B. (1977), who provided with how to estimate the pa-
∗Corresponding author
Email addresses: olivier.lopez@ensae.fr(Olivier Lopez), xavier.milhaud@ensae.fr (Xavier Milhaud)
rameters of a finite mixture model thanks to the EM algorithm. Applica- tions involving finite mixtures are usually focused on dividing explicitly the population structure into subpopulations, given that we originally do not know to which subpopulation each individual belongs. This is what is com- monly called an incomplete data problem, where the crucial point lies in determining the right number of subpopulations (or components) in order to perform a model-based clustering at the end. Meanwhile, generalized linear models (GLM) have undergone vigorous development since the early 1980’s.
Their great popularity may be due to their high flexibility and ability to con- sider both categorical and continuous risk factors (see McCullagh and Nelder (1989) and references therein). In many situations, finite mixtures of GLM can be very useful to deal with a large heterogeneity concerning the impact of those risk factors on some phenomenon in the population under study.
This could be interpreted as complex interactions between a response and some covariates.
We focus in this paper on the topic of selecting the “best” GLM mixture;
where “best” should be understood as the best trade-off between the fit and the clustering confidence (this notion will be more detailed further). In this view, we study a criterion which was originally proposed by Baudry (2009).
We derive general exponential bounds for this type of criterion, and show how these results can be applied to the selection of the order (number of components) of GLM mixtures. Selecting one model within a collection is an important statistical problem that has undergone vigorous developments in the literature. In the mixture framework, a universal solution has failed to emerge to answer the question of selecting the right order. Many articles have been dedicated to the implementation of algorithmic mixture calibra- tion techniques. Nevertheless, they often suffer from a lack of theoretical justification with respect to their convergence properties.
Based on the information theory, Oliviera-Brochado and Vitorino Martins (2005) point that there are basically two main approaches to infer the order of a mixture: hypothesis tests, and information and classification criteria.
Garel (2007) highlights the difficulty of establishing multimodality by means
of the generalized likelihood ratio test in the first approach, because the clas-
sical result according to which the test statistic is χ
2-distributed is generally
not applicable where mixtures are concerned. Azais, Gassiat, and Mercadier
(2006) and Azais, Gassiat, and Mercadier (2009), building on Gassiat (2002),
offer a detailed solution to overcome this issue. When using information cri-
teria, most authors (McLachlan and Peel (2000), Fraley and Raftery (1998))
agree that the BIC criterion gives better results than AIC since it seeks to minimize the Kullback-Leibler (KL) divergence to the true distribution (Raftery (1994), Ripley (1995)). For example, the convergence of BIC to esti- mate the order of gaussian mixtures has been proved in Keribin (1999). More generally, Gassiat and Van Handen (2013) unrolls the convergence properties (towards the theoretical model) of a model selected by a likelihood-penalized criterion (where the penalty linearly depends on the model dimension) in the context of mixtures. In practice, it is a well known fact that these two cri- teria tend to overestimate the theoretical number of components, especially when the model is misspecified (Baudry (2009)). This statement is not re- ally surprising: AIC, BIC and their variations were originally proposed for model selection problems in regular statistical models, and thus their usage is not well-supported or motivated for model selection in non-standard mod- els such as mixtures. Celeux and Soromenho (1996) and Biernacki (2000) were the firsts to introduce a classification criterion to avoid this overesti- mation, namely the ICL criterion. However, ICL consistency has not been proved in the context of maximum likelihood theory. Baudry (2009) recently demonstrated the consistency of a slightly modified version of ICL under the gaussian mixture framework; but no such property has been established in the context of GLM mixtures.
The aim of this paper is two-fold: obtaining new theoretical results con-
cerning the classification criterion introduced by Baudry (2009), and develop-
ing its application in the context of GLM mixtures. In section 2, we consider
a general mixture framework and define the conditional classification likeli-
hood. By maximizing this contrast, we obtain an estimator of the mixture
parameters which differs from the maximum likelihood estimator. Indeed,
its purpose is to find a compromise between a small classification error and
a good fit to data. We obtain a general bound for the estimation error based
on concentration inequalities. In section 3, general penalized criteria are con-
sidered to select the order of the mixture, and we determine conditions for
the consistency of such procedures. The application of these results to GLM
mixtures follows in section 4. The practical behavior of this approach is then
investigated through simulation studies in section 5.
2. The maximum conditional classification likelihood estimator 2.1. Context of mixtures
Let ( Y , F ) be a measurable space and let (f
θ)
θ∈Θbe a parametric family of densities on Y . The parameter θ is assumed to range over a set Θ ∈ B(R
d);
where B(.) denotes the Borel sets and d ≥ 1. For any probability measure ν on (Θ, B(Θ)), the mixture density f
νis defined on Y by
f
ν(y) = Z
Θ
f
θ(y) ν(dθ) = Z
Θ
f (y; θ) ν(dθ).
ν is the mixing distribution and (f
θ) is called the mixands. If ν has finite support, f
νis a finite mixture density. In the present paper, our interest lies in discrete mixtures: for any y ∈ Y , the density f
νis assumed to belong to a collection of densities M
gdefined as
M
g= n
f(y; ψ
g) =
ng
X
i=1
π
if
i(y; θ
i) | ψ
g= (π
1, ..., π
ng, θ
1, ..., θ
ng) ∈ Ψ
go , (1)
where Ψ
g= Π
ng× Θ
ng, with Π
ng⊂
(π
1, ..., π
ng) : P
ngi=1
π
i= 1 and π
i≥ 0 and Θ
ng= (θ
1, ..., θ
ng).
We will denote K
gthe dimension of the parameter set Ψ
g. Based on i.i.d.
observations Y = (Y
1, . . . , Y
n), the corresponding likelihood is given by
∀ ψ
g∈ Ψ
g, L(ψ
g; Y
1, ..., Y
n) = L(ψ
g) = Y
nj=1 ng
X
i=1
π
if
i(Y
j; θ
i). (2) Let us note that the true density function, f
0(y), may not belong to any M
gin a misspecified case. The maximum likelihood estimator (M LE) ˆ ψ
M LEgis defined as the maximizer of L(ψ
g) over Ψ
g(in full generality, it may not be unique). Under some regularity conditions, ˆ ψ
gM LEconverges towards ψ
M LEg, which is the true parameter when the model is correctly specified.
2.2. A new contrast: the conditional classification likelihood
To figure out in what the optimization by conditional classification likeli-
hood consists, we first have to define the conditional classification likelihood
itself. This function is derived from the general principle of the EM algorithm
and approximates the j
thindividual likelihood of the complete data (Y
j, δ
j),
where δ
j= (δ
ij)
i∈J1,ngKis the latent component indicator (more precisely δ
ijequals 1 if observation j belongs to component i, 0 otherwise).
Several authors have attempted to exploit the link between the likelihood of the observed data and the likelihood of the complete data (Celeux and Gov- aert (1992)), originally noted by Hathaway (1986). A specific term appears while writing the likelihood relatively to the complete data (Y, δ), hereafter named the classification likelihood: ∀ ψ
g∈ Ψ
g,
lnLc(ψg;Y, δ) = Xn
j=1 ng
X
i=1
δij ln (πifi(Yj;θi))
= Xn
j=1 ng
X
i=1
δij ln πifi(Yj;θi) Png
k=1πkfk(Yj;θk)
| {z }
τi(Yj;ψg)
+
Xn
j=1
=1
z }| {
ng
X
i=1
δijln
ng
X
k=1
πkfk(Yj;θk)
!
| {z }
lnL(ψg;Y)
= Xn
j=1 ng
X
i=1
δij ln
τi(Yj;ψg)
+ lnL(ψg;Y) (3)
τ
i(Y
j; ψ
g) is the a posteriori probability that observation j belongs to com- ponent i. The term that binds the two likelihoods is very close to what is commonly called the entropy:
∀ ψ
g∈ Ψ
g, ∀ y
j∈ R
d, Ent(ψ
g; y
j) = −
ng
X
i=1
τ
i(y
j; ψ
g) ln
τ
i(y
j; ψ
g) . This function results from the expectation (w.r.t.δ) taken in the first member of the right-hand term in (3), hence the “conditional classification likelihood”
denoted further by L
cc:
lnLcc(ψg;Y) =Eδ[lnLc(ψg;Y, δ)] = lnL(ψg;Y) + Xn
j=1 ng
X
i=1
Eδ[δij|Yj] ln τi(Yj;ψg)
= lnL(ψg;Y)−Ent(ψg;Y), (4)
where Ent(ψ
g; Y) = P
nj=1
Ent(ψ
g; Y
j).
The entropy is maximum in case of equiprobability (τ
1(Y
j; ψ
g) = ... = τ
ng(Y
j; ψ
g));
and minimum when one of the a posteriori probabilities is worth 1. As high-
lighted by equation (4), this term can be seen as a penalization of the observed
likelihood: the bigger the lack of confidence when making the a posteriori
classification (via the Bayes rule), the greater the penalization (and vice versa). In fact, the entropy has a zero limit when τ
itends to 0 or 1.
However, it is not differentiable at 0: consider the function h(τ
i) = τ
iln τ
i, then we have lim
τi→0+h
′(τ
i) = −∞ . This will be a key point in the definition of the parameters space that is acceptable to ensure the convergence of the estimator based on the conditional classification likelihood. We must there- fore avoid the a posteriori proportions of the mixture tending to zero. Also essential is to keep in mind that the mixture model should be identifiable, see McLachlan and Peel (2000) (p.26) for a further discussion on this issue.
More generally, the L
ccexpression enables to deduce additional constraints to be imposed on the parameters space (so that L
ccdoes not diverge) by studying its limits. Most of time, this suggests that critical situations corre- spond mainly to parameters that would not be bounded (Baudry (2009)).
Define the maximum conditional classification likelihood estimator (M L
ccE) ψ ˆ
M Lg ccE= arg max
ψg∈Ψg
1 n
X
nj=1
ln L
cc(ψ
g; y
j). (5) It should converge towards ψ
M Lg ccE= arg max
ψg∈ΨgE
f0[ln L
cc(ψ
g, Y )].
Baudry (2009) provides us with the following example so as to catch in what ψ
M Lg ccEdiffers from ψ
gM LE. Recall that the latter aims at minimizing the KL divergence between f(.; ψ
g) and the theoretical distribution f
0(.).
Example. f
0is the normal density N (0, 1). Consider the model M =
1
2 f
N(.; − µ, σ
2) + 1
2 f
N(.; µ, σ
2) ; µ ∈ R, σ
2∈ R
+∗,
where no further condition is imposed. There is no closed-form expression for ψ
M Lg ccEin this example (even when σ
2is fixed!). However one can compute it numerically: (µ
M LccE, σ
M LccE) = (0.83, √
0.31). This means that there exists a unique maximizer of E
f0[ln L
cc(µ, σ
2)] in Ψ
g(up to a label switch), which is obviously different from ψ
M LEg. Indeed, ψ
gM LE= (µ
M LE, σ
M LE) = (0, 1), which leads to nothing else than the theoretical distribution itself.
It shows that the M L
ccE does not aim at recovering the theoretical distri- bution, even when contained in the model under consideration. Here, M LE has no rule to designate two suitable classes (components) for this model:
it would therefore construct the same two exactly superimposed f
N(.; 0, 1).
The allocation of observations to one or any of these components would then be completely arbitrary (with probability 0.5, hence maximum entropy). In contrast, the compromise sought by the M L
ccE, which penalizes such exces- sive entropy, leads to find another estimator resulting in greater confidence in the assignment of observations to mixture components.
2.3. Exponential bound for the M L
ccE
To shorten the notation, let ψ
gb= ψ
gM LccEand ˆ ψ
g= ˆ ψ
M Lg ccE. Define φ(ψ
g; y) = ln L
cc(ψ
g; y) − ln L
cc(ψ
bg; y), and d(ψ
g, ψ
bg) = − E [φ(ψ
g; Y )].
If the model is correctly specified, that is if f
0( · ) = f ( · ; ψ
bg), the function d is the Kullback-Leibler divergence between f ( · ; ψ
g) and f ( · ; ψ
bg). In full generality, this quantity will be different from the Kullback-Leibler, but will express some pseudo-distance between the parameters ψ
gand ψ
gb. Let us observe that, by definition of ψ
bg, d(ψ
g, ψ
gb) ≥ 0 for all ψ
g∈ Ψ
g.
The main result of this section is to provide an exponential bound for the deviation probability of the L
cccontrast, centered by its expectation in the case where ln L
ccis bounded. If the contrast is unbounded, up to some additional moment condition, the exponential bound is perturbed by a polynomial term multiplied by a constant which is a decreasing function of the sample size. As a corollary, we deduce bounds for d( ˆ ψ
g, ψ
bg). Let us note that, in view of applying our result to GLM inference, we require to have a result which is adapted to unbounded contrasts ln L
cc(for many GLM distributions, the logarithm of the response density is unbounded).
To obtain the exponential bound, we first require an assumption which ensures a domination of the components f
ias well as of their derivatives.
Assumption 1. Assume that Θ is a compact subset of R
d. Denote by
∇
θf
i(y; θ
i) (resp. ∇
2θf
i(y; θ
i)) the vector (resp. the matrix) of partial deriva- tives of f
iwith respect to each component of θ
i. Assume that, for all i = 1, ..., n
gand all θ ∈ Θ,
f
i(y; θ) ≤ Λ ˜
0(y) < ∞ ,
f
i(y; θ) ≥ Λ ˜
−(y) > 0,
k∇
θf
i(y; θ) k
∞≤ Λ ˜
1(y),
k∇
2θf
i(y; θ) k
∞≤ Λ ˜
2(y),
with sup
l=0,1,2Λ ˜
j(y)˜ Λ
−(y)
−1≤ A(y). ˜
In the case where the functions f
iare not bounded, we require some mo- ment assumptions on both ˜ A(y) defined in Assumption 1 and some functions related to the contrast evaluated at the true parameter.
Assumption 2. Using the notations of Assumption 1, assume that there exists m > 0 such that
E [ ˜ A(Y )
m] + E [ |∇
ψgln f(Y ; ψ
gb) |
m] + E [ sup
i=1,...,ng
| g
i,ψbg(Y ) |
m] < ∞ , where the g
i,ψ(y) corresponds to the notations of Lemma B1.
We now state the main result of this section.
Theorem 1. Let
P (x; g) = P sup
ψg∈Ψg
X
nj=1
ln L
cc(ψ
g; Y
j) − ln L
cc(ψ
gb; Y
j) + d(ψ
g, ψ
bg) k ψ
g− ψ
gbk
> x
! , where Ψ
gis a set of parameters such that, for all ψ
g= (π
1, ..., π
ng, θ
1, ..., θ
ng) ∈ Ψ
g, for all 1 ≤ i ≤ n
g, π
i≥ π
−> 0. Assume that ψ
bgis an interior point of Ψ
g. Under Assumptions 1 and 2 with m − ε ≥ 2 for some ε ≥ 0, there exists four constants A
3, A
4, A
5and A
6(depending on the parameter space Θ and on the functions f
ionly) such that
P (x; g) ≤ 4
exp
− A
3x
2n
+ exp
− A
4x n
1/2−ε+ A
5x
(m−ε)/2, for x > A
6n
1/2[ln n]
1/2.
Proof. Let φ
ψg(y) = {
lnLcc(ψg;y)−lnLcc(ψgb;y)}
kψg−ψgbk
.
We can decompose φ
ψg(y) = φ
1ψg(y) − φ
2ψg(y), where φ
1ψg(y) =
ln f(y; ψ
g) − ln f (y; ψ
gb)
k ψ
g− ψ
gbk , φ
2ψg(y) = Ent(ψ
g, y) − Ent(ψ
bg, y)
k ψ
g− ψ
gbk .
The proof consists of applying the concentration inequality of Proposition A1,
along with Proposition A2 to the classes of functions F
l= { φ
lψg: ψ
g∈ Ψ
g}
for l = 1, 2. To apply Proposition A2, we have to check that polynomial
bounds on the covering numbers of these two classes hold (condition (i) in
Proposition A2). This is done in the first step of the proof. Nevertheless, Proposition A1 and A2 require the boundedness of the class of functions that one considers. Therefore it can only be obtained for a truncated version of these two classes, that is F
l1
Fl(y)≤Mfor some M going to infinity at some reasonable rate, and some appropriate function F
l. The application of the concentration inequality to the truncated version is performed in the second step of the proof. In a third step, the difference between the truncated version and the remainder term is considered. Finally, in a fourth step of the proof, all the results are gathered.
Step 1: covering numbers requirements.
Let A
i= { π f
i(y; θ) : θ ∈ Θ, π ∈ [π
−, 1] } . Due to Assumption 1, a first order Taylor expansion shows that
ln f(y; ψ
g) − ln f (y; ψ
gb) k ψ
g− ψ
gbk
≤ n
gd Λ ˜
1(y) Λ ˜
−(y) ,
where we recall that d is the dimension of Θ. So the class F
1admits the en- velope F
1(y) = ∇
ψgln f(y; ψ
bg) + n
gd A(y)diam(Ψ ˜
g), where diam(Ψ
g) denotes the diameter of Ψ
gwith respect to k · k . Observe that, from Assumption 1, it follows from a second order Taylor expansion and Lemma 2.13 in Pakes and Pollard (1989) that N
F1(ε, A
i) ≤ C
1ε
−V1, for some constants C
1> 0 and V
1> 0. Since F
1= P
ngi=1
A
i, Lemma 16 in Nolan and Pollard (1987) applies, so that N
F1(ε, F
1) ≤ C
1n
nggV1ε
−ngV1.
For the class F
2, the bound on the covering number is a consequence of Lemma B2. The assumptions of Lemma B1, required to obtain Lemma B2, clearly hold from Assumption 1, with Λ
−(y) = ˜ Λ(y), Λ
0(y) = ˜ Λ
0(y), Λ
1(y) = d Λ ˜
1(y) + ˜ Λ
0(y), Λ
2(y) = 2
−1d
2Λ ˜
2(y). The envelope of F
2is F
2(y) = n
g[Λ
3(y)diam(Ψ
g) + sup
i=1,...,ng| g
i,ψ0(y) | ], with Λ
3(y) = CA(y)
3.
Step 2: concentration inequality for truncated classes.
Introduce a constant M
n> 0, and consider the classes F
lMn= F
l1
Fl(y)≤Mnfor l = 1, 2, where the functions F
lare the envelope functions defined in Step 1. Observe that the covering number of F
lMncan be bounded using the same bound as in Step 1, since truncation does not alter this bound (this can be seen as a consequence of Lemma A.1 in Einmahl and Mason (2000)). Let
P
1(l)(x; g ) = P sup
ψg∈Ψg
X
nj=1
{ φ
lψg(Y
j) − E[φ
lψg(Y )] } 1
Fl(Yj)≤Mn> x
!
.
To bound this probability, we combine Proposition A1 and Proposition A2.
The requirements (ii) and (iii) of Proposition A2 hold with M = M
n, σ
2= E [F
l(Y )
2], while the requirement (i) follows from Step 1. Observe that M
ncan be taken large enough so that M
n≥ σ (in Step 4 of the proof, we will make M
ntend to infinity). Following the notations of Proposition A2, introducing a sequence of Rademacher variables (ε
j)
1≤j≤nindependent from (Y
j)
1≤j≤n, we get
E
"
sup
ψg∈Ψg
X
nj=1
ε
jφ
lψg(Y
j)1
Fl(Yj)≤Mn#
≤ C
gn
1/2[log(M
n)]
1/2, (6) where C
gis a constant depending on K
g, the dimension of the model.
Taking u = x(2A
1)
−1in Proposition A1, we get, for x > 2A
1C
gn
1/2[log M
n]
1/2, that the probability P
1(l)(x; g) is bounded by
P sup
ψg∈Ψg
X
nj=1
{ φ
Mlψng(Y
j) − E[φ
Mlψng(Y )] }
> A
1E
"
sup
ψ∈Ψg
X
nj=1
ε
jφ
lψg(Y
j)1
Fl(Yj)≤Mn# + u
!!
, where φ
Mlψng(y) = φ
lψg(y)1
Fl(y)≤Mn. Hence, from Proposition A1 with σ
F2Ml
=
σ
2, we get
P
1(l)(x; g) ≤ 2
exp
− C
2x
2n
+ exp
− C
3x M
n, with C
2= A
2[4A
21σ
2]
−1, and C
3= A
2[2A
1]
−1.
Step 3: remainder term.
Define φ
Mlψngc(y) = φ
lψg(y)1
Fl(y)>Mn. We have
X
nj=1
φ
Mlψngc(Y
j) ≤
X
nj=1
F
l(Y
j)1
Fl(Yj)>Mn=: S
l,Mn.
Hence, from Markov’s inequality, P(S
l,Mn> x) ≤
nxkkE[F
l(Y )
k1
Fl(Y)>Mn].
Next, from Cauchy-Schwarz inequality,
E[F
l(Y )
k1
Fl(Y)>Mn] ≤ E[F
l(Y )
2k]
1/2P(F
l(Y ) > M
n)
1/2. Again, from Markov’s inequality, P(F
l(Y ) > M
n) ≤
E[FMl(Yk′)k′]n
.
This finally leads to
P(S
l,Mn> x) ≤ n
kx
kM
nk′/2E[F
l(Y )
k′]
1/2E[F
l(Y )
2k]
1/2. (7)
Take M
n= n
1/2−ε. Then n
kM
nk′/2is equal to 1 provided that k
′= 2k + ε. We take k
′= m, which corresponds to k = m/2 − ε/2. Next,
E[φ
Mlψngc(Y )]
≤ E
F
l(Y )
21/2P(F
l(Y ) > M
n)
1/2≤ E[F
l(Y )
m]
1/2M
nm/2E[F
l(Y )
2]
1/2. Therefore, since (m − ε) ≥ 2,
X
nj=1
E[φ
Mlψngc(Y )]
≤ E[F
l(Y )
4k]
1/2E[F
l(Y )
2]
1/2=: C
5. Hence, for x > C
5,
P sup
ψg∈Ψg
X
nj=1
E[φ
Mlψngc(Y )]
> x
!
= 0. (8)
Let
P
2(l)(x; g) = P sup
ψg∈Ψg
X
nj=1
{ φ
Mlψngc(Y
j) − E[φ
Mlψngc(Y )] }
> x
! . It follows from (7) and (8) that
P
2(l)(x; g) ≤ P(S
l,Mn> x/2) + P sup
ψg∈Ψg
X
nj=1
E[φ
Mlψngc(Y )]
> x/2
!
≤ C
6x
(m−ε)/2, for x > C
5.
Step 4: summary.
We have
P (x; g) ≤ X
2l=1
P
1(l)(x/4; g) + P
2(l)(x/4; g).
From Step 2 and 3, we deduce that P (x; g ) ≤ 4
exp
− C
2x
216n
+ exp
− C
3x 4M
n+ C
7x
(m−ε)/2,
for x > max(C
5, 2A
1C
gn
1/2log M
n). The result follows from the fact that we
can impose C
glarge enough so that C
5≤ 2A
1C
gn
1/2log M
n, and from the
fact that we imposed M
n= n
1/2−ε/2at Step 3.
Remark 1: in the bound of P (x; g), two terms decrease exponentially, while a third one decreases in a polynomial way. This additional term is the price to pay for considering potentially unbounded variables Y (see Gassiat (2002) and Gassiat and Van Handen (2013) for related bounds in the bounded case).
If we increase the assumptions on Y, by assuming the existence of an expo- nential moment for ˜ A(y) instead of a finite m
thmoment for m large enough in Assumption 2, a better bound can be obtained. This will especially be the case when one considers bounded variables Y, which lead to a bounded function ˜ A(y). In appendix Appendix C, we show how this bound can be obtained under this more restrictive assumption.
Remark 2: it is easy to see, from the proof of Theorem 1, that a similar bound holds if ln L
ccis replaced by the log-likelihood, and ψ
gbis the limit of the M LE. Indeed, the proof is divided into proving bounds for the classical log-likelihood, and for the entropy term. In this last situation, note that the restriction of the probabilities π
ito values larger than π
−is not required.
This restriction in Theorem 1 was imposed by the behavior of the derivative of the entropy near 0, which could explode otherwise. This problem does not appear when one only considers the log-likelihood.
2.4. Almost sure rates for the M L
ccE
Corollary 1. Assume that ln L
ccis twice differentiable with respect to ψ
g, and denote by H
ψgthe Hessian matrix of E[ln L
cc(ψ
g; Y )] evaluated at ψ
g. Assume that, for some c > 0, ψ
gTH
ψgψ
g> c k ψ
gk
22for all ψ
g∈ Ψ
g, where k · k
2denotes the L
2− norm. Then, under the assumptions of Theorem 1, for m ≥ 2 in Assumption 2 and for the norm k · k
2, we have
k ψ ˆ
g− ψ
gbk
2= O
P1 n
1/2. If m > 4,
k ψ ˆ
g− ψ
bgk
2= O
a.s.[ln n]
1/2n
1/2.
Proof. Observe that, from a second order Taylor expansion, d( ˆ ψ
g, ψ
gb) ≥ c k ψ ˆ
g− ψ
gbk
22. By definition of ˆ ψ
g, we have
X
nj=1
ln L
cc( ˆ ψ
g; Y
j) − ln L
cc(ψ
gb; Y
j)
k ψ ˆ
g− ψ
gbk
2≥ 0.
Therefore, X
nj=1
{ ln L
cc( ˆ ψ
g; Y
j) − ln L
cc(ψ
bg; Y
j) } k ψ ˆ
g− ψ
gbk
2+ nd( ˆ ψ
g, ψ
bg) k ψ ˆ
g− ψ
gbk
2≥ nd( ˆ ψ
g, ψ
bg) k ψ ˆ
g− ψ
gbk
2≥ cn k ψ ˆ
g− ψ
bgk
2. Applying Theorem 1, we get, for x > A
6n
1/2[ln n]
1/2,
P
cn k ψ ˆ
g− ψ
gbk
2> x
≤ P (x; g) ≤ 4
exp
− A
3x
2n
+ exp
− A
4x n
1/2−ε+ A
5x
(m−ε)/2. Define E
n(u) = P(n
1/2k ψ ˆ
g− ψ
gbk
2> u[ln n]
1/2). We have P(E
n(u)) ≤ P (x; g)
with x = ucn
1/2[ln n]
1/2, if u > A
6. Proving the almost sure rate of Corollary 1 is done by applying the Borel-Cantelli Lemma to the sets { E
n(u) }
n∈N, for some u large enough. We need to show that for some u large enough, P
n≥1
P(E
n(u)) < ∞ . We have, for u > A
6, X
∞n=1
P(E
n(u)) ≤ X
∞n=1
4 n
A3u2+
X
∞n=1
4 exp − A
4n
ε[ln n]
1/2u +
X
∞n=1
A
5u
m/4c
m/2n
(m−ε)/2[ln n]
(m−ε)/2. We see that the first sum in the right-hand side is finite provided that u >
A
−1/23. The second sum is finite if ε > 0. The third is finite if m > 4 and ε taken sufficiently small.
To prove the O
P− rate of Corollary 1, we need to show that p
n(u) = P(E
n(u/[ln n]
1/2)) tends to zero when u tends to infinity. Using the same arguments as before, for m ≥ 2,
p
n(u) ≤ 4 exp( − A
3u
2) + 4 exp( − A
4[ln n]
1/2u) + 2
4mA
5c
−m/2u
−m/2, where the right-hand side tends to zero when u tends to infinity.
Remark 3: it also follows the proof of Corollary 1 the stronger result 1
n X
nj=1
{ ln L
cc( ˆ ψ
g; Y
j) − ln L
cc(ψ
bg; Y
j) }
k ψ ˆ
g− ψ
bgk + d( ˆ ψ
g, ψ
gb)
k ψ ˆ
g− ψ
bgk = O
a.s.[ln n]
1/2n
−1/2. This implies
1 n
X
nj=1
{ ln L
cc( ˆ ψ
g; Y
j) − ln L
cc(ψ
bg; Y
j) } + d( ˆ ψ
g, ψ
bg) = O
a.s.[ln n]n
−1. (9)
3. A new penalized selection criterion: I CL
∗As mentioned before, a crucial issue in clustering and mixture analysis is to determine the appropriate order of the mixture to correctly describe the data. Biernacki (2000) tried to circumvent the challenge faced by BIC as for selecting the right number of classes, especially in the case of a misspecified mixture model. He wanted to emulate the BIC approach by replacing the observed likelihood by the classification likelihood, and eliminate the prob- lem of overestimating the order in the mixture. This way, he expected to find a criterion that allows achieving a better compromise between the classifica- tion quality and the fit to data. This criterion, called ICL, is henceforth well suited to issues of population clustering. But a particular attention should be paid to the definition of the penalty: early works used to consider en- tropy as part of the penalty. Unfortunately no theoretical result could be demonstrated from this viewpoint, despite promising results in practical ap- plications (Biernacki et al. (2006)). Baudry (2009) then proposed to redefine the ICL criterion by combining it to the L
cccontrast. The penalty thus be- comes identical to that of BIC, and the estimator (M L
ccE) used to express the “new” ICL criterion differs from the maximum likelihood estimator. In this regard, we have in the previous section shown the strong convergence of this estimator towards the theoretical parameter of the underlying distribu- tion under particular regularity conditions. We now focus on the selection process from a finite collection of nested models M
g, g = { 1, ..., G } .
3.1. Previous works on ICL criteria
The ICL criterion was defined on the same basis as the BIC criterion:
Biernacki (2000) suggests to select in the collection the model satisfying M
ICL= arg min
Mg∈{M1,...,MG}
− max
ψg∈Ψg
ln L
c(ψ
g; Y, δ) + K
g2 ln n .
In practice, one approximates arg max
ψgL
c(ψ
g; Y, δ) by ˆ ψ
gM LEwhen n gets
large, which is clearly questionable since the contrast is different from the
classical likelihood. Besides, the label vector δ is not observed, so that the
Bayes rule is used on a posteriori probabilities to assign observations to each
mixture component: the predicted label is denoted ˆ δ
Band also depends on
the MLE. This leads to consider the following procedure:
MICLa = arg min
Mg∈{M1,...,MG}
−lnLc( ˆψM LEg ;Y,δˆB) +Kg 2 lnn
= arg min
Mg∈{M1,...,MG}
−lnL( ˆψM LEg ;Y)− Xn
j=1 ng
X
i=1
δˆijBlnτi(Yj; ˆψgM LE) +Kg 2 lnn
.
McLachlan and Peel (2000) suggest to use the a posteriori probabilities τ
i(y; ˆ ψ
gM LE) instead of ˆ δ
B:
M
ICLb= arg min
Mg∈{M1,...,MG}
− ln L
c( ˆ ψ
M LEg; Y, τ ( ˆ ψ
M LEg)) + K
g2 ln n
= arg min
Mg∈{M1,...,MG}
− ln L( ˆ ψ
M LEg; Y) + Ent( ˆ ψ
gM LE) + K
g2 ln n
| {z }
penICLb(Kg)
.
In fact, ICL
aand ICL
bare really different in practice only if ∀ i, τ
i(Y
j; ˆ ψ
M LEg) ≃ 1/n
g. Some basic algebra shows that ICL
a≥ ICL
b: this means that ICL
apenalizes to a greater extent a model whose observations allocation is un- certain than does ICL
b. Biernacki (2000) and McLachlan and Peel (2000) have shown, through various simulated and real-life examples, that the ICL criterion is more robust than the BIC criterion when the model is misspeci- fied (which is often the case in reality). Granted, BIC and ICL have similar behaviors when the mixture components are distinctly separated; but ICL severely penalizes the likelihood in the reverse case, still taking into account its complexity. However, there is no clear relationship between the maxi- mum likelihood theory and the entropy. In addition, the criterion defined as such is not fully satisfactory from a theoretical viewpoint. Indeed, its prop- erties have not been proved yet: for instance it is not consistent in the sense that BIC is, because its penalty does not satisfy Nishii’s conditions (Nishii (1988)). In particular, it is not negligible in front of n:
1
n Ent(ψ
g; Y)
n→∞−→
PE
f0[Ent(ψ
g; Y )] > 0
It follows that Ent(ψ
g; Y) = O(n). Until very recently, there was therefore
clearly a gap between the practical interest aroused by ICL and its theoretical
justification. This was partly plugged by Baudry (2009) who introduced a
new version of ICL integrating a “BIC-type penalty”:
M
ICL∗= arg min
Mg∈{M1,...,MG}
− ln L
cc( ˆ ψ
M Lg ccE) + K
g2 ln n .
In the context of gaussian mixtures, Baudry (2009) has rigorously shown that the number of components selected using this criterion converges weakly towards the theoretical one, but only in the bounded case.
3.2. Consistency of selection criteria
Still in the mixture modelling framework, let M
g∗denote the model with smallest dimension K
g∗such that E[ln L
cc(ψ
gb∗)] = max
g=1,...,GE[ln L
cc(ψ
gb)].
The following theorem provides consistency properties of a class of penalized estimators. Related results can be found in Baudry (2009).
Theorem 2. Consider a collection of models (M
1, ..., M
G) satisfying the as- sumptions of Theorem 1.Consider a penalty function pen(M
g) = K
gu
n, and
ˆ
g = arg max
g=1,...,G
1 n
X
nj=1
ln L
cc(ψ
gb; Y
j) − pen(M
g)
! . Then, if m > 2 in Assumption 2 and if nu
n→ ∞ , we get ∀ g 6 = g
∗P(ˆ g = g) = o(1).
If m > 4 in Assumption 2, there exists some constant C such that, if nu
n>
C ln n, almost surely, g ˆ 6 = g for n large enough and for all g 6 = g
∗. Proof. Let
ε
g= E
ln L
cc(ψ
bg∗; Y )
− E
ln L
cc(ψ
gb; Y ) . Decompose
1 n
Xn
j=1
lnLcc( ˆψg∗;Yj)−lnLcc( ˆψg;Yj) =εg+ 1 n
Xn
j=1
nlnLcc(ψgb∗;Yj)−E[lnLcc(ψbg∗;Y)]o
− 1 n
Xn
j=1
nlnLcc(ψbg;Yj)−E[lnLcc(ψgb;Y)]o
+ 1 n
Xn
j=1
nlnLcc( ˆψg∗;Yj)−lnLcc(ψgb∗;Yj) +d( ˆψg∗, ψbg∗)o
− 1 n
Xn
j=1
n
lnLcc( ˆψg;Yj)−lnLcc(ψbg;Yj) +d( ˆψg, ψgb)o . (10)
It follows from the remark following Corollary 1 that the last two terms in (10) are O
a.s.([ln n]n
−1) (or O
P(n
−1)). We now distinguish two cases: ǫ
g> 0 and ǫ
g= 0.
Step 1: ǫ
g> 0.
It follows from the law of iterated logarithm that
1 n
X
nj=1
ln L
cc(ψ
gb∗; Y
j) − E[ln L
cc(ψ
bg∗; Y )]
+ 1 n
X
nj=1
ln L
cc(ψ
gb; Y
j) − E[ln L
cc(ψ
gb; Y )]
= O
a.s.([ln ln n]
1/2n
−1/2).
Note that these two terms are O
P(n
−1/2) if we only focus on O
P− rates.
If ˆ g = g, we have 1
n X
nj=1
ln L
cc( ˆ ψ
g∗; Y
j) − ln L
cc( ˆ ψ
g; Y
j) − pen(g
∗) + pen(g) < 0. (11) However, due to the previous remarks, if we take u
n= o(1), the left-hand side in (11) converges almost surely towards ε
g(in probability rates, is equal to ε
g+ o
P(1)). This ensures that M
gis almost surely not selected for n large enough (in probability rates, P(ˆ g = g) = o(1)).
Step 2: ǫ
g= 0.
Since ψ
bg= ψ
bg∗, 1
n X
nj=1
ln L
cc(ψ
gb∗; Y
j) − E[ln L
cc(ψ
bg∗; Y )] = 1 n
X
nj=1
ln L
cc(ψ
gb; Y
j) − E[ln L
cc(ψ
gb; Y )] ,
and ε
g= 0, which shows that the first three terms in (10) are zero. This leads to
1 n
X
nj=1
ln L
cc( ˆ ψ
g∗; Y
j) − ln L
cc( ˆ ψ
g; Y
j) − pen(g
∗) + pen(g) ≥
u
n+ O
a.s. lnn nu
n+ O
P 1n