HAL Id: hal-00796198
https://hal.archives-ouvertes.fr/hal-00796198
Preprint submitted on 1 Mar 2013
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Semiparametric estimation of a mixture of two linear regressions in which one component is known
Laurent Bordes, Ivan Kojadinovic, Pierre Vandekerkhove
To cite this version:
Laurent Bordes, Ivan Kojadinovic, Pierre Vandekerkhove. Semiparametric estimation of a mixture of two linear regressions in which one component is known. 2013. �hal-00796198�
Semiparametric estimation of a mixture of two linear regressions in which one component is known
L. Bordesa, I. Kojadinovica and P. Vandekerkhoveb
aLaboratoire de math´ematiques et applications, UMR CNRS 5142
Universit´e de Pau et des Pays de l’Adour B.P. 1155, 64013 Pau Cedex, France
{laurent.bordes,ivan.kojadinovic}@univ-pau.fr
bLaboratoire d’analyse et de math´ematiques appliqu´ees
UMR CNRS 8050, Universit´e Paris-Est, Marne-la-Vall´ee, France, and Georgia Institute of Technology, UMI Georgia Tech CNRS 2958
G.W. Woodruff School of Mechanical Engineering, Atlanta, USA pierre.vandekerkhove@univ-mlv.fr
January 15, 2013
Abstract
A new estimation method for the two-component mixture model introduced in Van- dekerkhove (2012) is proposed. This model, which consists of a two-component mixture of linear regressions in which one component is entirely known while the proportion, the slope, the intercept and the error distribution of the other component are unknown, seems to be of interest for the analysis of large datasets produced from two-color ChIP- chip high-density microarrays. In spite of good performance for datasets of reasonable size, the method proposed in Vandekerkhove (2012) suffers from a serious drawback when the sample size becomes large, as it is based on the optimization of a contrast function whose pointwise computation requires O(n2) operations. The range of appli- cability of the method derived in this work is substantially larger as it is based on a method-of-moment estimator whose computation only requiresO(n) operations. From a theoretical perspective, the asymptotic normality of both the estimator of the Eu- clidean parameter vector and of the semiparametric estimator of the c.d.f. of the error is proved under weak conditions not involving the zero-symmetry assumption typically used this last decade. The finite-sample performance of the latter estimators is studied
under various scenarios through Monte Carlo experiments. From a more practical per- spective, the proposed method is applied to the tone data analyzed, among others, by Hunter and Young (2012), and to the ChIPmix data studied by Martin-Magniette et al.
(2008). An extension of the considered model involving an unknown scale parameter for the first component is discussed in the final section.
1 Introduction
Practitioners are frequently interested in modeling the relationship between a random re- sponse variable Y and a d-dimensional random explanatory vector X by means of a linear regression model estimated from a random sample (Xi, Yi)1≤i≤n of (X, Y). Quite often, the homogeneity assumption claiming that the linear regression coefficients are the same for all the observations (X1, Y1), . . . ,(Xn, Yn) is inadequate. To allow different parameters for dif- ferent groups of observations, a Finite Mixture of Regressions (FMR) can be considered; see Leisch (2004) and Gr¨un and Leisch (2006) for a nice overview.
Statistical inference for the fully parametric FMR model was first considered by Quandt and Ramsey (1978) who proposed an estimation method based on the moment generating function. An EM estimating approach was proposed by De Veaux (1989) in the case of two components. Variations of the latter approach were also considered in Jones and McLach- lan (1992) and Turner (2000). Hawkins et al. (2001) studied the problem of determining the number of components in the parametric FMR model using methods derived from the likelihood equation. In Hurn et al. (2003), the authors proposed a Bayesian approach to estimate the regression coefficients and also considered an extension of the model in which the number of components is unspecified. Zhu and Zhang (2004) established the asymp- totic theory for maximum likelihood estimators in parametric FMR models. More recently, St¨adler et al. (2010) proposed an ℓ1-penalized method based on a Lasso-type estimator for a high-dimensional FMR model with d≫n.
As an alternative to parametric approaches to the estimation of a FMR model, some authors suggested the use of more flexible semiparametric approaches. This research direc- tion finds its origin in the work of Hall and Zhou (2003) in which d-variate semiparametric mixture models of random vectors with independent components were considered. These au- thors showed in particular that, for d≥3, it is possible to identify a two-component model without parametrizing the distributions of the component random vectors. To the best of our knowledge, Leung and Qin (2006) were the first to estimate a FMR model semipara- metrically. In the two-component case, they studied the situation in which the components are related by Anderson (1979)’s exponential tilt model. Hunter and Young (2012) studied the identifiability of an m-component semiparametric FMR model and numerically investi- gated an EM algorithm for estimating its parameters. Vandekerkhove (2012) proposed an M-estimation method for a two-component semiparametric mixture of regressions with sym- metric errors in which one component is known. The latter approach was applied to data extracted from a high-density microarray and modeled in Martin-Magniette et al. (2008) by
means of a parametric FMR. The semiparametric approach of Vandekerkhove (2012) is of interest for two main reasons. Due to its semiparametric nature, the method allows to detect complex structures in the error of the unknown regression component. It can additionally be regarded as a tool to assess the relevance of the usual EM-type Euclidean parameter estimation. Its main drawbacks however are that it is not theoretically valid when the errors are not symmetric and that its use is very computationally expensive for large datasets as it requires the optimization of a contrast function whose pointwise evaluation requires O(n2) operations.
The object of interest of this paper is the two-component FMR model studied by Vandek- erkhove (2012) in which one component is entirely known while the proportion, the slope, the intercept and the error distribution of the other component are unknown. The estimation of the Euclidean parameter vector is achieved through a method of moments. Semipara- metric estimators of the c.d.f. and the p.d.f. of the error of the unknown component are proposed. The proof of the asymptotic normality of the Euclidean and functional estimators is not based on zero-symmetry-like assumptions frequently found in the literature but only involves finite moments of order eight for the explanatory variable and the boundness of the p.d.f.s of the errors and their derivatives. The almost sure uniform consistency of the estima- tor of the p.d.f. of the unknown error is obtained under similar conditions. A consequence of these theoretical results is that, unlike for EM-type approaches, the estimation uncertainty can be assessed through large-sample standard errors for the Euclidean parameters and by means of an approximate confidence band for the c.d.f. of the unknown error. The latter is computed using an unconditional weighted bootstrap whose asymptotic validity is proved.
From a practical perspective, it is worth mentioning that the range of applicability of the resulting semiparametric estimation procedure is substantially larger than the one of Vandekerkhove (2012) as its computation only requires O(n) operations. As a consequence, very large datasets can be easily processed. For instance, as shall be seen in Section 6, the estimation of the parameters of the model from the ChIPmix data considered in Martin- Magniette et al. (2008) consisting of n = 176,343 observations took less than 30 seconds on one 2.4 GHz processor. The estimation of the same model from a subset of n = 30,000 observations using the method of Vandekerkhove (2012) took more than two days on a similar processor.
The paper is organized as follows. Section 2 is devoted to a detailed description of the model, while Section 3 is concerned with its identifiability through the moment method. The estimators of the Euclidean parameter vector and of the functional parameter are described in detail in Section 4. The finite-sample performance of the proposed estimation method is studied for various scenarios through Monte Carlo experiments in Section 5. In Section 6, the proposed method is applied to the tone data analyzed, among others, by Hunter and Young (2012), and to the ChIPmix data considered in Martin-Magniette et al. (2008). An extension of the FMR model under consideration involving an unknown scale parameter for the first component is discussed in the final section.
2 Problem and notation
Let Z be a Bernoulli random variable with unknown parameter π0 ∈ [0,1], let X be an X- valued random variable withX ⊂R, and letε∗, ε∗∗be two absolutely continuous centered real valued random variables with finite variances and independent of X. Assume additionally that Z is independent ofX, ε∗ and ε∗∗. Furthermore, for fixed α∗0, β0∗, α∗∗0 , β0∗∗ ∈R, let ˜Y be the random variable defined by
Y˜ = (1−Z)(α∗0+β0∗X+ε∗) +Z(α∗∗0 +β0∗∗X+ε∗∗), i.e.,
Y˜ =
α∗0+β0∗X+ε∗ if Z = 0, α∗∗0 +β0∗∗X+ε∗∗ if Z = 1.
The above display is the equation of a mixture of two linear regressions with Z as mixing variable.
LetF∗ and F∗∗ denote the c.d.f.s ofε∗ andε∗∗, respectively. Furthermore,α∗0,β0∗ andF∗ are assumed known whileα∗∗0 ,β0∗∗,π0 andF∗∗are assumed unknown. The aim of this work is to propose and study an estimator of (α∗∗0 , β0∗∗, π0, F∗∗) based onn i.i.d. copies (Xi,Y˜i)1≤i≤n
of (X,Y˜). Now, defineY = ˜Y −α∗0−β0∗X,α0 =α∗∗0 −α∗0 and β0 =β0∗∗−β0∗, and notice that Y =
ε∗ if Z = 0,
α0+β0X+ε if Z = 1, (1)
where, to simplify the notation,ε=ε∗∗andF =F∗∗. It follows that the previous estimation problem is equivalent to the problem of estimating (α0, β0, π0, F) from the observation of n i.i.d. copies (Xi, Yi)1≤i≤n of (X, Y).
As we continue, the unknown c.d.f.s of X and Y will be denoted byFX and FY, respec- tively. Also, for any x ∈ X, the conditional c.d.f. of Y given X = x will be denoted by FY|X(·|x), and we have
FY|X(y|x) = (1−π0)F∗(y) +π0F(y−α0−β0x), y∈R. (2) It follows that, for any x ∈ X, fY|X(·|x), the conditional p.d.f. of Y given X = x, can be expressed as
fY|X(y|x) = (1−π0)f∗(y) +π0f(y−α0−β0x), y∈R, (3) where f∗ and f are the p.d.f.s of ε∗ and ε, assuming that they exist on R.
Note that, as shall be discussed in Section 7, it is possible to consider a slightly more general version of this model involving an unknown scale parameter for the first component.
This more elaborate model remains identifiable and estimation through the moment method is theoretically possible. However, from a practical perspective, estimation of this scale parameter through the moment method seems quite unstable insomuch as that an alternative estimation method appears required.
3 Identifiability through the moment method
Since (1) is clearly equivalent to
Y = (1−Z)ε∗+Z(α0+β0X+ε), (4) we immediately obtain that
E(Y|X) =π0α0+π0β0X a.s. (5) It follows that the coefficients γ0,1 =π0α0 and γ0,2 =π0β0 can be identified from (5) if X is not reduced to a singleton. In addition, we have
E(Y2|X) = E[{(1−Z)ε∗+Z(α0+β0X+ε)}2|X] a.s.
= E(1−Z)E{(ε∗)2}+E(Z)E{(α0+β0X)2+ε2|X} a.s.
= (1−π0)(σ∗0)2+π0 α02+ 2α0β0X+β02X2+σ20 a.s.
= (1−π0)(σ∗0)2+π0(α20+σ02) + 2π0α0β0X+π0β02X2 a.s., (6) where σ0∗ and σ0 are the standard deviations of ε∗ and ε, respectively. If X contains three pointsx1, x2, x3 such that the vectors{(1, x1, x21),(1, x2, x22),(1, x3, x23)}are linearly indepen- dent then, from (6), we can identify the coefficients γ0,3 = (1− π0)(σ0∗)2 +π0(α20 + σ20), γ0,4 = 2π0α0β0 and γ0,5 =π0β02. In other words, under the aforementioned conditions on X, we have
γ0,1 = π0α0
γ0,2 = π0β0
γ0,3 = (1−π0)(σ0∗)2+π0(α02+σ02) γ0,4 = 2π0α0β0 = 2α0γ0,2
γ0,5 = π0β02 =β0γ0,2.
(7)
From the above system of equations, we see that α0, β0 and π0 can be identified provided π0β0 6= 0, that is, provided the unknown component actually exists and its slope is non zero.
The latter condition will be assumed to hold in the rest of the paper.
Let us now consider the functional part F of the model. For anyη = (α, β)∈R2, denote byJ(·,η) the c.d.f. defined by
J(t,η) = Pr(Y ≤t+α+βX), t∈R. (8) For any t∈R, this can be rewritten as
J(t,η) = Z
R
FY|X(t+α+βx|x)dFX(x)
=(1−π0) Z
R
F∗(t+α+βx)dFX(x) +π0
Z
R
F{t+ (α−α0) + (β−β0)x}dFX(x).
Forη =η0 = (α0, β0), we then obtain J(t,η0) = (1−π0)
Z
R
F∗(t+α0+β0x)dFX(x) +π0F(t), t∈R. Now, for any η∈R2, let K(·,η) be defined by
K(t,η) = Z
R
F∗(t+α+βx)dFX(x), t∈R. (9) It follows that F is identified since
F(t) = 1
π0 {J(t,η0)−(1−π0)K(t,η0)}, t ∈R. (10) The above equation is at the root of the derivation of an estimator forF.
4 Estimation
LetP be the probability distribution of (X, Y). For ease of exposition, we will frequently use the notation adopted in the theory of empirical processes in the sense of van der Vaart and Wellner (2000) or Kosorok (2008) for instance. Given a measurable function f : R2 →Rk, for some integer k ≥ 1, P f will denote the integral R
fdP. Also, the empirical measure obtained from the random sample (Xi, Yi)1≤i≤n will be denoted by Pn = n−1Pn
i=1δXi,Yi, where δx,y is the probability distribution that assigns a mass of 1 at (x, y). The expectation off under the empirical measure is thenPnf =n−1Pn
i=1 f(Xi, Yi) and the quantityGnf =
√n(Pnf−P f) is the empirical processevaluated atf. The arrow ‘ ’ will be used to denote weak convergence in the sense of Definition 1.3.3 of van der Vaart and Wellner (2000) and, for any setS,ℓ∞(S) will stand for the space of all bounded real-valued functions onS equipped with the uniform metric. Key results and more details can be found for instance in van der Vaart (1998), van der Vaart and Wellner (2000) and Kosorok (2008).
4.1 Estimation of the Euclidean parameter vector
To estimate the Euclidean parameter vector (α0, β0, π0)∈R×R\ {0} ×(0,1], we first need to estimate the vector γ0 = (γ0,1, . . . , γ0,5)∈R5 whose components were expressed in terms of α0, β0 and π0 in the previous section. From (5) and (6), it is natural to consider the regression function
dn(γ) = Pnϕγ, γ ∈R5, where, for any γ∈R5, ϕγ :R2 →R is defined by
ϕγ(x, y) = (y−γ1 −γ2x)2 + (y2−γ3−γ4x−γ5x2)2, x, y ∈R.
As an estimator ofγ0, we then naturally consider γn= arg minγdn(γ) that satisfies d˙n(γn) =Pnϕ˙γn = 0,
where ˙ϕγ, the gradient of ϕγ with respect to γ, is given by
˙
ϕγ(x, y) = −2
y−γ1 −γ2x x(y−γ1−γ2x) y2−γ3−γ4x−γ5x2 x(y2−γ3−γ4x−γ5x2) x2(y2−γ3−γ4x−γ5x2)
, x, y ∈R.
Now, for any integers p, q ≥1, define
XpYq = 1 n
n
X
i=1
XipYiq, and let
Γn= 2
1 X 0 0 0
X X2 0 0 0
0 0 1 X X2
0 0 X X2 X3 0 0 X2 X3 X4
and θn= 2
Y XY
Y2 XY2 X2Y2
,
which respectively estimate
Γ0 = 2
1 E(X) 0 0 0
E(X) E(X2) 0 0 0
0 0 1 E(X) E(X2)
0 0 E(X) E(X2) E(X3) 0 0 E(X2) E(X3) E(X4)
and θ0 = 2
E(Y) E(XY)
E(Y2) E(XY2) E(X2Y2)
.
The linear equation Pnϕ˙γn = 0 can then equivalently be rewritten as Γnγn =θn. Provided the matrices Γn and Γ0 are invertible, we can write γn = Γ−1n θn and γ0 = Γ−10 θ0.
To obtain an estimator of (α0, β0, π0), we use the relationships induced by (5) and (6) and recalled in (7). Leaving the third equation aside because it involves the unknown standard deviation σ0 of ε, we obtain three possible estimators of α0:
αn= γn,1γn,5
γn,22 , αn = γn,4
2γn,2
, or αn = γn,42 4γn,1γn,5
,
three possibles estimators of β0: βn = γn,5
γn,2
, βn = γn,4
2γn,1
, or βn= γn,2γn,42 4γn,5γn,12 ,
and, three possibles estimators of π0: πn = γn,22
γn,5
, πn = 2γn,1γn,2
γn,4
, or πn = 4γn,12 γn,5
γn,42 .
There are therefore 27 possible estimators of (α0, β0, π0). Their asymptotics can be obtained under very reasonable conditions. Unfortunately, all 27 estimators turned out to behave quite poorly in small samples. This prompted us to look for alternative estimators within the “same class”.
We now describe an estimator of (α0, β0, π0) that was obtained empirically and that be- haves significantly better for small samples than the aforementioned ones. The new regression function under consideration is dn(γ) =Pnϕγ, γ ∈R8, where, for any (x, y)∈R2,
ϕγ(x, y) = (y−γ1−γ2x)2+ (y2−γ3−γ4x2)2+ (x−γ5)2+ (x2−γ6)2+ (x3−γ7)2+ (x4−γ8)2. Now, let
Γn= 2
1 X 0 0 0 0 0 0
X X2 0 0 0 0 0 0
0 0 1 X2 0 0 0 0
0 0 X2 X4 0 0 0 0
0 0 0 0 1 0 0 0
0 0 0 0 0 1 0 0
0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1
and θn= 2
Y XY
Y2 X2Y2
X X2 X3 X4
,
which respectively estimate
Γ0 = 2
1 E(X) 0 0 0 0 0 0
E(X) E(X2) 0 0 0 0 0 0
0 0 1 E(X2) 0 0 0 0
0 0 E(X2) E(X4) 0 0 0 0
0 0 0 0 1 0 0 0
0 0 0 0 0 1 0 0
0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1
and θ0 = 2
E(Y) E(XY)
E(Y2) E(X2Y2)
E(X) E(X2) E(X3) E(X4)
.
Then, proceeding as previously, provided the matrices Γnand Γ0are invertible, the estimator γn = arg minγdn(γ) of γ0 = Γ−10 θ0 is given by γn = Γ−1n θn. To obtain an estimator of (α0, β0, π0), we have, from the second term of the regression function, that
γ0,4 = cov(X2, Y2)
V(X2) = cov(X2, Y2) γ0,8−γ0,62 ,
where the second equality comes from the fact that γ0,6 = E(X2) and γ0,8 = E(X4). Now, using (4), we find
cov(X2, Y2) =π0β02V(X2) + 2π0α0β0cov(X2, X), which, combined with the fact that γ0,1 =π0α0 and γ0,2 =π0β0, gives
cov(X2, Y2) =γ0,2β0(γ0,8−γ0,62 ) + 2γ0,1β0(γ0,7−γ0,5γ0,6).
This leads to the following estimator of (α0, β0, π0):
βn=gβ(γn) = γn,4
γn,2+ 2γn,1(γn,7−γn,5γn,6)/(γn,8−γn,62 ), πn=gπ(γn) = γn,2
βn , αn=gα(γn) = γn,1
πn
.
As we continue, the subsets of R8 on which the functions gα, gβ and gπ exist and are differentiable will be denoted by Dα, Dβ and Dπ, respectively, and Dα,β,π will stand for Dα∩ Dβ ∩ Dπ.
To derive the asymptotic behavior of the estimator (αn, βn, πn) = (gα(γn), gβ(γn), gπ(γn)), we consider the following assumptions:
A1. (i) X has a finite fourth order moment; (ii) X has a finite eighth order moment.
A2. V(X)>0 and V(X2)>0.
Clearly, Assumption A1 (ii) implies Assumption A1 (i), and Assumption A2 implies that the matrix Γ0 is invertible.
The following result, proved in Appendix A, characterizes the asymptotic behavior of the estimator (αn, βn, πn).
Proposition 4.1. Assume that γ0 ∈ Dα,β,π.
(i) Under Assumptions A1 (i) and A2, (αn, βn, πn)−−→a.s. (α0, β0, π0).
(ii) Suppose that Assumptions A1 (ii) and A2 are satisfied and let Ψγ be the 3 by 8 matrix defined by
Ψγ =
∂gα
∂γ1 · · · ∂g∂γα8
∂gβ
∂γ1 · · · ∂g∂γβ8
∂gπ
∂γ1 · · · ∂g∂γπ8
(γ), γ∈ Dα,β,π.
Then, √
n(αn−α0, βn−β0, πn−π0) = −Gn(Ψγ0Γ−10 ϕ˙γ0) +oP(1).
As a consequence, √
n(αn−α0, βn−β0, πn−π0) converges in distribution to a centered normal random vector with covariance matrix Σ = Ψγ0Γ−10 P( ˙ϕγ0ϕ˙⊤γ0)Γ−10 Ψ⊤γ0, which can be consistently estimated by Σn = ΨγnΓ−1n Pn( ˙ϕγnϕ˙⊤γ
n)Γ−1n Ψ⊤γ
n in the sense that Σn
−−→a.s. Σ.
An immediate consequence of the previous result is that large-sample standard errors of αn, βn and πn are given by the square root of the diagonal elements of the matrix n−1Σn. The finite-sample performance of these estimators is investigated in Section 5 and they are used in the illustrations of Section 6.
4.2 Estimation of the functional parameter
To estimate the unknown c.d.f. F of ε, it is natural to start from (10). For a known η = (α, β)∈ R2, the term J(·,η) defined in (8) may be estimated by the empirical c.d.f. of the random sample (Yi−α−βXi)1≤i≤n, i.e.,
Jn(t,η) = 1 n
n
X
i=1
1(Yi−α−βXi ≤t), t ∈R.
Similarly, sinceF∗ (the c.d.f. ofε∗) is known, a natural estimator of the termK(t,η) defined in (9) is given by the empirical mean of the random sample {F∗(t+α+βXi)}1≤i≤n, i.e.,
Kn(t,η) = 1 n
n
X
i=1
F∗(t+α+βXi), t∈R.
To obtain estimators of J(·,η0) and K(·,η0), it is then natural to consider the plug-in estimators Jn(·,ηn) and Kn(·,ηn), respectively, based on the estimator ηn = (αn, βn) = (gα, gβ)(γn) of η0 proposed in the previous subsection.
We shall therefore consider the following nonparametric estimator of F : Fn(t) = 1
πn{Jn(t,ηn)−(1−πn)Kn(t,ηn)}, t∈R. (11) Note that Fn is not necessarily a c.d.f. as it is not necessarily increasing and can be smaller than zero or greater than one. In practice, we shall consider the partially corrected estimator (Fn∨0)∧1, where ∨ and ∧ denote the maximum and minimum, respectively.
To derive the asymptotic behavior of the previous estimator, we consider the following additional assumptions on the p.d.f.s f∗ and f of ε∗ and ε, respectively:
A3. (i)f∗ and f exist and are bounded on R; (ii) (f∗)′ and f′ exist and are bounded onR.