• Aucun résultat trouvé

Semiparametric estimation of a mixture of two linear regressions in which one component is known

N/A
N/A
Protected

Academic year: 2021

Partager "Semiparametric estimation of a mixture of two linear regressions in which one component is known"

Copied!
39
0
0

Texte intégral

(1)

HAL Id: hal-00796198

https://hal.archives-ouvertes.fr/hal-00796198

Preprint submitted on 1 Mar 2013

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Semiparametric estimation of a mixture of two linear regressions in which one component is known

Laurent Bordes, Ivan Kojadinovic, Pierre Vandekerkhove

To cite this version:

Laurent Bordes, Ivan Kojadinovic, Pierre Vandekerkhove. Semiparametric estimation of a mixture of two linear regressions in which one component is known. 2013. �hal-00796198�

(2)

Semiparametric estimation of a mixture of two linear regressions in which one component is known

L. Bordesa, I. Kojadinovica and P. Vandekerkhoveb

aLaboratoire de math´ematiques et applications, UMR CNRS 5142

Universit´e de Pau et des Pays de l’Adour B.P. 1155, 64013 Pau Cedex, France

{laurent.bordes,ivan.kojadinovic}@univ-pau.fr

bLaboratoire d’analyse et de math´ematiques appliqu´ees

UMR CNRS 8050, Universit´e Paris-Est, Marne-la-Vall´ee, France, and Georgia Institute of Technology, UMI Georgia Tech CNRS 2958

G.W. Woodruff School of Mechanical Engineering, Atlanta, USA pierre.vandekerkhove@univ-mlv.fr

January 15, 2013

Abstract

A new estimation method for the two-component mixture model introduced in Van- dekerkhove (2012) is proposed. This model, which consists of a two-component mixture of linear regressions in which one component is entirely known while the proportion, the slope, the intercept and the error distribution of the other component are unknown, seems to be of interest for the analysis of large datasets produced from two-color ChIP- chip high-density microarrays. In spite of good performance for datasets of reasonable size, the method proposed in Vandekerkhove (2012) suffers from a serious drawback when the sample size becomes large, as it is based on the optimization of a contrast function whose pointwise computation requires O(n2) operations. The range of appli- cability of the method derived in this work is substantially larger as it is based on a method-of-moment estimator whose computation only requiresO(n) operations. From a theoretical perspective, the asymptotic normality of both the estimator of the Eu- clidean parameter vector and of the semiparametric estimator of the c.d.f. of the error is proved under weak conditions not involving the zero-symmetry assumption typically used this last decade. The finite-sample performance of the latter estimators is studied

(3)

under various scenarios through Monte Carlo experiments. From a more practical per- spective, the proposed method is applied to the tone data analyzed, among others, by Hunter and Young (2012), and to the ChIPmix data studied by Martin-Magniette et al.

(2008). An extension of the considered model involving an unknown scale parameter for the first component is discussed in the final section.

1 Introduction

Practitioners are frequently interested in modeling the relationship between a random re- sponse variable Y and a d-dimensional random explanatory vector X by means of a linear regression model estimated from a random sample (Xi, Yi)1≤i≤n of (X, Y). Quite often, the homogeneity assumption claiming that the linear regression coefficients are the same for all the observations (X1, Y1), . . . ,(Xn, Yn) is inadequate. To allow different parameters for dif- ferent groups of observations, a Finite Mixture of Regressions (FMR) can be considered; see Leisch (2004) and Gr¨un and Leisch (2006) for a nice overview.

Statistical inference for the fully parametric FMR model was first considered by Quandt and Ramsey (1978) who proposed an estimation method based on the moment generating function. An EM estimating approach was proposed by De Veaux (1989) in the case of two components. Variations of the latter approach were also considered in Jones and McLach- lan (1992) and Turner (2000). Hawkins et al. (2001) studied the problem of determining the number of components in the parametric FMR model using methods derived from the likelihood equation. In Hurn et al. (2003), the authors proposed a Bayesian approach to estimate the regression coefficients and also considered an extension of the model in which the number of components is unspecified. Zhu and Zhang (2004) established the asymp- totic theory for maximum likelihood estimators in parametric FMR models. More recently, St¨adler et al. (2010) proposed an 1-penalized method based on a Lasso-type estimator for a high-dimensional FMR model with dn.

As an alternative to parametric approaches to the estimation of a FMR model, some authors suggested the use of more flexible semiparametric approaches. This research direc- tion finds its origin in the work of Hall and Zhou (2003) in which d-variate semiparametric mixture models of random vectors with independent components were considered. These au- thors showed in particular that, for d3, it is possible to identify a two-component model without parametrizing the distributions of the component random vectors. To the best of our knowledge, Leung and Qin (2006) were the first to estimate a FMR model semipara- metrically. In the two-component case, they studied the situation in which the components are related by Anderson (1979)’s exponential tilt model. Hunter and Young (2012) studied the identifiability of an m-component semiparametric FMR model and numerically investi- gated an EM algorithm for estimating its parameters. Vandekerkhove (2012) proposed an M-estimation method for a two-component semiparametric mixture of regressions with sym- metric errors in which one component is known. The latter approach was applied to data extracted from a high-density microarray and modeled in Martin-Magniette et al. (2008) by

(4)

means of a parametric FMR. The semiparametric approach of Vandekerkhove (2012) is of interest for two main reasons. Due to its semiparametric nature, the method allows to detect complex structures in the error of the unknown regression component. It can additionally be regarded as a tool to assess the relevance of the usual EM-type Euclidean parameter estimation. Its main drawbacks however are that it is not theoretically valid when the errors are not symmetric and that its use is very computationally expensive for large datasets as it requires the optimization of a contrast function whose pointwise evaluation requires O(n2) operations.

The object of interest of this paper is the two-component FMR model studied by Vandek- erkhove (2012) in which one component is entirely known while the proportion, the slope, the intercept and the error distribution of the other component are unknown. The estimation of the Euclidean parameter vector is achieved through a method of moments. Semipara- metric estimators of the c.d.f. and the p.d.f. of the error of the unknown component are proposed. The proof of the asymptotic normality of the Euclidean and functional estimators is not based on zero-symmetry-like assumptions frequently found in the literature but only involves finite moments of order eight for the explanatory variable and the boundness of the p.d.f.s of the errors and their derivatives. The almost sure uniform consistency of the estima- tor of the p.d.f. of the unknown error is obtained under similar conditions. A consequence of these theoretical results is that, unlike for EM-type approaches, the estimation uncertainty can be assessed through large-sample standard errors for the Euclidean parameters and by means of an approximate confidence band for the c.d.f. of the unknown error. The latter is computed using an unconditional weighted bootstrap whose asymptotic validity is proved.

From a practical perspective, it is worth mentioning that the range of applicability of the resulting semiparametric estimation procedure is substantially larger than the one of Vandekerkhove (2012) as its computation only requires O(n) operations. As a consequence, very large datasets can be easily processed. For instance, as shall be seen in Section 6, the estimation of the parameters of the model from the ChIPmix data considered in Martin- Magniette et al. (2008) consisting of n = 176,343 observations took less than 30 seconds on one 2.4 GHz processor. The estimation of the same model from a subset of n = 30,000 observations using the method of Vandekerkhove (2012) took more than two days on a similar processor.

The paper is organized as follows. Section 2 is devoted to a detailed description of the model, while Section 3 is concerned with its identifiability through the moment method. The estimators of the Euclidean parameter vector and of the functional parameter are described in detail in Section 4. The finite-sample performance of the proposed estimation method is studied for various scenarios through Monte Carlo experiments in Section 5. In Section 6, the proposed method is applied to the tone data analyzed, among others, by Hunter and Young (2012), and to the ChIPmix data considered in Martin-Magniette et al. (2008). An extension of the FMR model under consideration involving an unknown scale parameter for the first component is discussed in the final section.

(5)

2 Problem and notation

Let Z be a Bernoulli random variable with unknown parameter π0 [0,1], let X be an X- valued random variable withX ⊂R, and letε, ε∗∗be two absolutely continuous centered real valued random variables with finite variances and independent of X. Assume additionally that Z is independent ofX, ε and ε∗∗. Furthermore, for fixed α0, β0, α∗∗0 , β0∗∗ R, let ˜Y be the random variable defined by

Y˜ = (1Z)(α0+β0X+ε) +Z(α∗∗0 +β0∗∗X+ε∗∗), i.e.,

Y˜ =

α0+β0X+ε if Z = 0, α∗∗0 +β0∗∗X+ε∗∗ if Z = 1.

The above display is the equation of a mixture of two linear regressions with Z as mixing variable.

LetF and F∗∗ denote the c.d.f.s ofε andε∗∗, respectively. Furthermore,α0,β0 andF are assumed known whileα∗∗0 ,β0∗∗,π0 andF∗∗are assumed unknown. The aim of this work is to propose and study an estimator of (α∗∗0 , β0∗∗, π0, F∗∗) based onn i.i.d. copies (Xi,Y˜i)1≤i≤n

of (X,Y˜). Now, defineY = ˜Y α0β0X,α0 =α∗∗0 α0 and β0 =β0∗∗β0, and notice that Y =

ε if Z = 0,

α0+β0X+ε if Z = 1, (1)

where, to simplify the notation,ε=ε∗∗andF =F∗∗. It follows that the previous estimation problem is equivalent to the problem of estimating (α0, β0, π0, F) from the observation of n i.i.d. copies (Xi, Yi)1≤i≤n of (X, Y).

As we continue, the unknown c.d.f.s of X and Y will be denoted byFX and FY, respec- tively. Also, for any x ∈ X, the conditional c.d.f. of Y given X = x will be denoted by FY|X(·|x), and we have

FY|X(y|x) = (1π0)F(y) +π0F(yα0β0x), yR. (2) It follows that, for any x ∈ X, fY|X(·|x), the conditional p.d.f. of Y given X = x, can be expressed as

fY|X(y|x) = (1π0)f(y) +π0f(yα0β0x), yR, (3) where f and f are the p.d.f.s of ε and ε, assuming that they exist on R.

Note that, as shall be discussed in Section 7, it is possible to consider a slightly more general version of this model involving an unknown scale parameter for the first component.

This more elaborate model remains identifiable and estimation through the moment method is theoretically possible. However, from a practical perspective, estimation of this scale parameter through the moment method seems quite unstable insomuch as that an alternative estimation method appears required.

(6)

3 Identifiability through the moment method

Since (1) is clearly equivalent to

Y = (1Z)ε+Z(α0+β0X+ε), (4) we immediately obtain that

E(Y|X) =π0α0+π0β0X a.s. (5) It follows that the coefficients γ0,1 =π0α0 and γ0,2 =π0β0 can be identified from (5) if X is not reduced to a singleton. In addition, we have

E(Y2|X) = E[{(1Z)ε+Z(α0+β0X+ε)}2|X] a.s.

= E(1Z)E{)2}+E(Z)E{0+β0X)2+ε2|X} a.s.

= (1π0)(σ0)2+π0 α02+ 2α0β0X+β02X2+σ20 a.s.

= (1π0)(σ0)2+π020+σ02) + 2π0α0β0X+π0β02X2 a.s., (6) where σ0 and σ0 are the standard deviations of ε and ε, respectively. If X contains three pointsx1, x2, x3 such that the vectors{(1, x1, x21),(1, x2, x22),(1, x3, x23)}are linearly indepen- dent then, from (6), we can identify the coefficients γ0,3 = (1 π0)(σ0)2 +π020 + σ20), γ0,4 = 2π0α0β0 and γ0,5 =π0β02. In other words, under the aforementioned conditions on X, we have

γ0,1 = π0α0

γ0,2 = π0β0

γ0,3 = (1π0)(σ0)2+π002+σ02) γ0,4 = 2π0α0β0 = 2α0γ0,2

γ0,5 = π0β02 =β0γ0,2.

(7)

From the above system of equations, we see that α0, β0 and π0 can be identified provided π0β0 6= 0, that is, provided the unknown component actually exists and its slope is non zero.

The latter condition will be assumed to hold in the rest of the paper.

Let us now consider the functional part F of the model. For anyη = (α, β)R2, denote byJ(·,η) the c.d.f. defined by

J(t,η) = Pr(Y t+α+βX), tR. (8) For any tR, this can be rewritten as

J(t,η) = Z

R

FY|X(t+α+βx|x)dFX(x)

=(1π0) Z

R

F(t+α+βx)dFX(x) +π0

Z

R

F{t+ (αα0) + (ββ0)x}dFX(x).

(7)

Forη =η0 = (α0, β0), we then obtain J(t,η0) = (1π0)

Z

R

F(t+α0+β0x)dFX(x) +π0F(t), tR. Now, for any ηR2, let K(·,η) be defined by

K(t,η) = Z

R

F(t+α+βx)dFX(x), tR. (9) It follows that F is identified since

F(t) = 1

π0 {J(t,η0)(1π0)K(t,η0)}, t R. (10) The above equation is at the root of the derivation of an estimator forF.

4 Estimation

LetP be the probability distribution of (X, Y). For ease of exposition, we will frequently use the notation adopted in the theory of empirical processes in the sense of van der Vaart and Wellner (2000) or Kosorok (2008) for instance. Given a measurable function f : R2 Rk, for some integer k 1, P f will denote the integral R

fdP. Also, the empirical measure obtained from the random sample (Xi, Yi)1≤i≤n will be denoted by Pn = n−1Pn

i=1δXi,Yi, where δx,y is the probability distribution that assigns a mass of 1 at (x, y). The expectation off under the empirical measure is thenPnf =n−1Pn

i=1 f(Xi, Yi) and the quantityGnf =

n(PnfP f) is the empirical processevaluated atf. The arrow ‘ ’ will be used to denote weak convergence in the sense of Definition 1.3.3 of van der Vaart and Wellner (2000) and, for any setS,(S) will stand for the space of all bounded real-valued functions onS equipped with the uniform metric. Key results and more details can be found for instance in van der Vaart (1998), van der Vaart and Wellner (2000) and Kosorok (2008).

4.1 Estimation of the Euclidean parameter vector

To estimate the Euclidean parameter vector (α0, β0, π0)R×R\ {0} ×(0,1], we first need to estimate the vector γ0 = (γ0,1, . . . , γ0,5)R5 whose components were expressed in terms of α0, β0 and π0 in the previous section. From (5) and (6), it is natural to consider the regression function

dn(γ) = Pnϕγ, γ R5, where, for any γR5, ϕγ :R2 R is defined by

ϕγ(x, y) = (yγ1 γ2x)2 + (y2γ3γ4xγ5x2)2, x, y R.

(8)

As an estimator ofγ0, we then naturally consider γn= arg minγdn(γ) that satisfies d˙nn) =Pnϕ˙γn = 0,

where ˙ϕγ, the gradient of ϕγ with respect to γ, is given by

˙

ϕγ(x, y) = 2

yγ1 γ2x x(yγ1γ2x) y2γ3γ4xγ5x2 x(y2γ3γ4xγ5x2) x2(y2γ3γ4xγ5x2)

, x, y R.

Now, for any integers p, q 1, define

XpYq = 1 n

n

X

i=1

XipYiq, and let

Γn= 2

1 X 0 0 0

X X2 0 0 0

0 0 1 X X2

0 0 X X2 X3 0 0 X2 X3 X4

and θn= 2

Y XY

Y2 XY2 X2Y2

,

which respectively estimate

Γ0 = 2

1 E(X) 0 0 0

E(X) E(X2) 0 0 0

0 0 1 E(X) E(X2)

0 0 E(X) E(X2) E(X3) 0 0 E(X2) E(X3) E(X4)

and θ0 = 2

E(Y) E(XY)

E(Y2) E(XY2) E(X2Y2)

.

The linear equation Pnϕ˙γn = 0 can then equivalently be rewritten as Γnγn =θn. Provided the matrices Γn and Γ0 are invertible, we can write γn = Γ−1n θn and γ0 = Γ−10 θ0.

To obtain an estimator of (α0, β0, π0), we use the relationships induced by (5) and (6) and recalled in (7). Leaving the third equation aside because it involves the unknown standard deviation σ0 of ε, we obtain three possible estimators of α0:

αn= γn,1γn,5

γn,22 , αn = γn,4

n,2

, or αn = γn,42 n,1γn,5

,

three possibles estimators of β0: βn = γn,5

γn,2

, βn = γn,4

n,1

, or βn= γn,2γn,42 n,5γn,12 ,

(9)

and, three possibles estimators of π0: πn = γn,22

γn,5

, πn = n,1γn,2

γn,4

, or πn = n,12 γn,5

γn,42 .

There are therefore 27 possible estimators of (α0, β0, π0). Their asymptotics can be obtained under very reasonable conditions. Unfortunately, all 27 estimators turned out to behave quite poorly in small samples. This prompted us to look for alternative estimators within the “same class”.

We now describe an estimator of (α0, β0, π0) that was obtained empirically and that be- haves significantly better for small samples than the aforementioned ones. The new regression function under consideration is dn(γ) =Pnϕγ, γ R8, where, for any (x, y)R2,

ϕγ(x, y) = (yγ1γ2x)2+ (y2γ3γ4x2)2+ (xγ5)2+ (x2γ6)2+ (x3γ7)2+ (x4γ8)2. Now, let

Γn= 2

1 X 0 0 0 0 0 0

X X2 0 0 0 0 0 0

0 0 1 X2 0 0 0 0

0 0 X2 X4 0 0 0 0

0 0 0 0 1 0 0 0

0 0 0 0 0 1 0 0

0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 1

and θn= 2

Y XY

Y2 X2Y2

X X2 X3 X4

,

which respectively estimate

Γ0 = 2

1 E(X) 0 0 0 0 0 0

E(X) E(X2) 0 0 0 0 0 0

0 0 1 E(X2) 0 0 0 0

0 0 E(X2) E(X4) 0 0 0 0

0 0 0 0 1 0 0 0

0 0 0 0 0 1 0 0

0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 1

and θ0 = 2

E(Y) E(XY)

E(Y2) E(X2Y2)

E(X) E(X2) E(X3) E(X4)

.

Then, proceeding as previously, provided the matrices Γnand Γ0are invertible, the estimator γn = arg minγdn(γ) of γ0 = Γ−10 θ0 is given by γn = Γ−1n θn. To obtain an estimator of 0, β0, π0), we have, from the second term of the regression function, that

γ0,4 = cov(X2, Y2)

V(X2) = cov(X2, Y2) γ0,8γ0,62 ,

(10)

where the second equality comes from the fact that γ0,6 = E(X2) and γ0,8 = E(X4). Now, using (4), we find

cov(X2, Y2) =π0β02V(X2) + 2π0α0β0cov(X2, X), which, combined with the fact that γ0,1 =π0α0 and γ0,2 =π0β0, gives

cov(X2, Y2) =γ0,2β00,8γ0,62 ) + 2γ0,1β00,7γ0,5γ0,6).

This leads to the following estimator of (α0, β0, π0):

βn=gβn) = γn,4

γn,2+ 2γn,1n,7γn,5γn,6)/(γn,8γn,62 ), πn=gπn) = γn,2

βn , αn=gαn) = γn,1

πn

.

As we continue, the subsets of R8 on which the functions gα, gβ and gπ exist and are differentiable will be denoted by Dα, Dβ and Dπ, respectively, and Dα,β,π will stand for Dα∩ Dβ ∩ Dπ.

To derive the asymptotic behavior of the estimator (αn, βn, πn) = (gα(γn), gβ(γn), gπ(γn)), we consider the following assumptions:

A1. (i) X has a finite fourth order moment; (ii) X has a finite eighth order moment.

A2. V(X)>0 and V(X2)>0.

Clearly, Assumption A1 (ii) implies Assumption A1 (i), and Assumption A2 implies that the matrix Γ0 is invertible.

The following result, proved in Appendix A, characterizes the asymptotic behavior of the estimator (αn, βn, πn).

Proposition 4.1. Assume that γ0 ∈ Dα,β,π.

(i) Under Assumptions A1 (i) and A2, n, βn, πn)−−→a.s. 0, β0, π0).

(ii) Suppose that Assumptions A1 (ii) and A2 are satisfied and let Ψγ be the 3 by 8 matrix defined by

Ψγ =

∂gα

∂γ1 · · · ∂g∂γα8

∂gβ

∂γ1 · · · ∂g∂γβ8

∂gπ

∂γ1 · · · ∂g∂γπ8

(γ), γ∈ Dα,β,π.

Then,

n(αnα0, βnβ0, πnπ0) = Gnγ0Γ−10 ϕ˙γ0) +oP(1).

(11)

As a consequence,

n(αnα0, βnβ0, πnπ0) converges in distribution to a centered normal random vector with covariance matrix Σ = Ψγ0Γ−10 P( ˙ϕγ0ϕ˙γ0−10 Ψγ0, which can be consistently estimated by Σn = ΨγnΓ−1n Pn( ˙ϕγnϕ˙γ

n−1n Ψγ

n in the sense that Σn

−−→a.s. Σ.

An immediate consequence of the previous result is that large-sample standard errors of αn, βn and πn are given by the square root of the diagonal elements of the matrix n−1Σn. The finite-sample performance of these estimators is investigated in Section 5 and they are used in the illustrations of Section 6.

4.2 Estimation of the functional parameter

To estimate the unknown c.d.f. F of ε, it is natural to start from (10). For a known η = (α, β) R2, the term J(·,η) defined in (8) may be estimated by the empirical c.d.f. of the random sample (YiαβXi)1≤i≤n, i.e.,

Jn(t,η) = 1 n

n

X

i=1

1(YiαβXi t), t R.

Similarly, sinceF (the c.d.f. ofε) is known, a natural estimator of the termK(t,η) defined in (9) is given by the empirical mean of the random sample {F(t+α+βXi)}1≤i≤n, i.e.,

Kn(t,η) = 1 n

n

X

i=1

F(t+α+βXi), tR.

To obtain estimators of J(·,η0) and K(·,η0), it is then natural to consider the plug-in estimators Jn(·,ηn) and Kn(·,ηn), respectively, based on the estimator ηn = (αn, βn) = (gα, gβ)(γn) of η0 proposed in the previous subsection.

We shall therefore consider the following nonparametric estimator of F : Fn(t) = 1

πn{Jn(t,ηn)(1πn)Kn(t,ηn)}, tR. (11) Note that Fn is not necessarily a c.d.f. as it is not necessarily increasing and can be smaller than zero or greater than one. In practice, we shall consider the partially corrected estimator (Fn0)1, where and denote the maximum and minimum, respectively.

To derive the asymptotic behavior of the previous estimator, we consider the following additional assumptions on the p.d.f.s f and f of ε and ε, respectively:

A3. (i)f and f exist and are bounded on R; (ii) (f) and f exist and are bounded onR.

Références

Documents relatifs

The main properties of this inner magnetospheric region are briefly recalled. The dynamical response of the plasma s phere to enhancements of geomagnetic activity

The role of Egyptian women in this success stems from their full integration within all levels of the health care system, with many of them holding top posts in the Ministry

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

(2006b) and Bordes and Vandekerkhove (2010), for semiparametric mixture models where one component is known, to mixture of regressions models with one known compo- nent,

Note: This examination consists of 8 questions on 2 pages. The astronomic azimuth of the line to B is found by observation of the Sun. The orthometric heights of the observation

Here we report the genetic isolation, cloning and molecu- lar and phenotypic characterization of the C. elegans gene mel-46. This gene encodes a putative DEAD box helicase and

In contrast, a systematic study of small degree embeddings (see in particular § 4.2) reveals that the images of these rational smooth embeddings are in a natural way smooth fibers of

3 There seems to be no established terminology for this concept, and our choice of the term ’perfect’ was suggested by the established term ’perfectly normal’... A