Semiparametric estimation of a mixture of two linear regressions in which one component is known

(1)

HAL Id: hal-00796198

https://hal.archives-ouvertes.fr/hal-00796198

Preprint submitted on 1 Mar 2013

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Semiparametric estimation of a mixture of two linear regressions in which one component is known

Laurent Bordes, Ivan Kojadinovic, Pierre Vandekerkhove

To cite this version:

Laurent Bordes, Ivan Kojadinovic, Pierre Vandekerkhove. Semiparametric estimation of a mixture of two linear regressions in which one component is known. 2013. �hal-00796198�

(2)

Semiparametric estimation of a mixture of two linear regressions in which one component is known

L. Bordes^a, I. Kojadinovic^a and P. Vandekerkhove^b

aLaboratoire de math´ematiques et applications, UMR CNRS 5142

Universit´e de Pau et des Pays de l’Adour B.P. 1155, 64013 Pau Cedex, France

{laurent.bordes,ivan.kojadinovic}@univ-pau.fr

bLaboratoire d’analyse et de math´ematiques appliqu´ees

UMR CNRS 8050, Universit´e Paris-Est, Marne-la-Vall´ee, France, and Georgia Institute of Technology, UMI Georgia Tech CNRS 2958

G.W. Woodruff School of Mechanical Engineering, Atlanta, USA [email protected]

January 15, 2013

Abstract

A new estimation method for the two-component mixture model introduced in Van- dekerkhove (2012) is proposed. This model, which consists of a two-component mixture of linear regressions in which one component is entirely known while the proportion, the slope, the intercept and the error distribution of the other component are unknown, seems to be of interest for the analysis of large datasets produced from two-color ChIP- chip high-density microarrays. In spite of good performance for datasets of reasonable size, the method proposed in Vandekerkhove (2012) suffers from a serious drawback when the sample size becomes large, as it is based on the optimization of a contrast function whose pointwise computation requires O(n²) operations. The range of applicability of the method derived in this work is substantially larger as it is based on a method-of-moment estimator whose computation only requiresO(n) operations. From a theoretical perspective, the asymptotic normality of both the estimator of the Eu- clidean parameter vector and of the semiparametric estimator of the c.d.f. of the error is proved under weak conditions not involving the zero-symmetry assumption typically used this last decade. The finite-sample performance of the latter estimators is studied

(3)

under various scenarios through Monte Carlo experiments. From a more practical perspective, the proposed method is applied to the tone data analyzed, among others, by Hunter and Young (2012), and to the ChIPmix data studied by Martin-Magniette et al.

(2008). An extension of the considered model involving an unknown scale parameter for the first component is discussed in the final section.

1 Introduction

Practitioners are frequently interested in modeling the relationship between a random re- sponse variable Y and a d-dimensional random explanatory vector X by means of a linear regression model estimated from a random sample (Xi, Yi)1≤i≤n of (X, Y). Quite often, the homogeneity assumption claiming that the linear regression coefficients are the same for all the observations (X1, Y1), . . . ,(Xn, Yn) is inadequate. To allow different parameters for different groups of observations, a Finite Mixture of Regressions (FMR) can be considered; see Leisch (2004) and Gr¨un and Leisch (2006) for a nice overview.

Statistical inference for the fully parametric FMR model was first considered by Quandt and Ramsey (1978) who proposed an estimation method based on the moment generating function. An EM estimating approach was proposed by De Veaux (1989) in the case of two components. Variations of the latter approach were also considered in Jones and McLach- lan (1992) and Turner (2000). Hawkins et al. (2001) studied the problem of determining the number of components in the parametric FMR model using methods derived from the likelihood equation. In Hurn et al. (2003), the authors proposed a Bayesian approach to estimate the regression coefficients and also considered an extension of the model in which the number of components is unspecified. Zhu and Zhang (2004) established the asymptotic theory for maximum likelihood estimators in parametric FMR models. More recently, St¨adler et al. (2010) proposed an ℓ1-penalized method based on a Lasso-type estimator for a high-dimensional FMR model with d≫n.

As an alternative to parametric approaches to the estimation of a FMR model, some authors suggested the use of more flexible semiparametric approaches. This research direc- tion finds its origin in the work of Hall and Zhou (2003) in which d-variate semiparametric mixture models of random vectors with independent components were considered. These authors showed in particular that, for d≥3, it is possible to identify a two-component model without parametrizing the distributions of the component random vectors. To the best of our knowledge, Leung and Qin (2006) were the first to estimate a FMR model semipara- metrically. In the two-component case, they studied the situation in which the components are related by Anderson (1979)’s exponential tilt model. Hunter and Young (2012) studied the identifiability of an m-component semiparametric FMR model and numerically investigated an EM algorithm for estimating its parameters. Vandekerkhove (2012) proposed an M-estimation method for a two-component semiparametric mixture of regressions with symmetric errors in which one component is known. The latter approach was applied to data extracted from a high-density microarray and modeled in Martin-Magniette et al. (2008) by

(4)

means of a parametric FMR. The semiparametric approach of Vandekerkhove (2012) is of interest for two main reasons. Due to its semiparametric nature, the method allows to detect complex structures in the error of the unknown regression component. It can additionally be regarded as a tool to assess the relevance of the usual EM-type Euclidean parameter estimation. Its main drawbacks however are that it is not theoretically valid when the errors are not symmetric and that its use is very computationally expensive for large datasets as it requires the optimization of a contrast function whose pointwise evaluation requires O(n²) operations.

The object of interest of this paper is the two-component FMR model studied by Vandek- erkhove (2012) in which one component is entirely known while the proportion, the slope, the intercept and the error distribution of the other component are unknown. The estimation of the Euclidean parameter vector is achieved through a method of moments. Semipara- metric estimators of the c.d.f. and the p.d.f. of the error of the unknown component are proposed. The proof of the asymptotic normality of the Euclidean and functional estimators is not based on zero-symmetry-like assumptions frequently found in the literature but only involves finite moments of order eight for the explanatory variable and the boundness of the p.d.f.s of the errors and their derivatives. The almost sure uniform consistency of the estimator of the p.d.f. of the unknown error is obtained under similar conditions. A consequence of these theoretical results is that, unlike for EM-type approaches, the estimation uncertainty can be assessed through large-sample standard errors for the Euclidean parameters and by means of an approximate confidence band for the c.d.f. of the unknown error. The latter is computed using an unconditional weighted bootstrap whose asymptotic validity is proved.

From a practical perspective, it is worth mentioning that the range of applicability of the resulting semiparametric estimation procedure is substantially larger than the one of Vandekerkhove (2012) as its computation only requires O(n) operations. As a consequence, very large datasets can be easily processed. For instance, as shall be seen in Section 6, the estimation of the parameters of the model from the ChIPmix data considered in Martin- Magniette et al. (2008) consisting of n = 176,343 observations took less than 30 seconds on one 2.4 GHz processor. The estimation of the same model from a subset of n = 30,000 observations using the method of Vandekerkhove (2012) took more than two days on a similar processor.

The paper is organized as follows. Section 2 is devoted to a detailed description of the model, while Section 3 is concerned with its identifiability through the moment method. The estimators of the Euclidean parameter vector and of the functional parameter are described in detail in Section 4. The finite-sample performance of the proposed estimation method is studied for various scenarios through Monte Carlo experiments in Section 5. In Section 6, the proposed method is applied to the tone data analyzed, among others, by Hunter and Young (2012), and to the ChIPmix data considered in Martin-Magniette et al. (2008). An extension of the FMR model under consideration involving an unknown scale parameter for the first component is discussed in the final section.

(5)

2 Problem and notation

Let Z be a Bernoulli random variable with unknown parameter π0 ∈ [0,1], let X be an X- valued random variable withX ⊂R, and letε^∗, ε^∗∗be two absolutely continuous centered real valued random variables with finite variances and independent of X. Assume additionally that Z is independent ofX, ε^∗ and ε^∗∗. Furthermore, for fixed α^∗₀, β₀^∗, α^∗∗₀ , β₀^∗∗ ∈R, let ˜Y be the random variable defined by

Y˜ = (1−Z)(α^∗₀+β₀^∗X+ε^∗) +Z(α^∗∗₀ +β₀^∗∗X+ε^∗∗), i.e.,

Y˜ =

α^∗₀+β₀^∗X+ε^∗ if Z = 0, α^∗∗₀ +β₀^∗∗X+ε^∗∗ if Z = 1.

The above display is the equation of a mixture of two linear regressions with Z as mixing variable.

LetF^∗ and F^∗∗ denote the c.d.f.s ofε^∗ andε^∗∗, respectively. Furthermore,α^∗₀,β₀^∗ andF^∗ are assumed known whileα^∗∗₀ ,β₀^∗∗,π0 andF^∗∗are assumed unknown. The aim of this work is to propose and study an estimator of (α^∗∗₀ , β₀^∗∗, π0, F^∗∗) based onn i.i.d. copies (Xi,Y˜i)1≤i≤n

of (X,Y˜). Now, defineY = ˜Y −α^∗₀−β₀^∗X,α0 =α^∗∗₀ −α^∗₀ and β0 =β₀^∗∗−β₀^∗, and notice that Y =

ε^∗ if Z = 0,

α0+β0X+ε if Z = 1, (1)

where, to simplify the notation,ε=ε^∗∗andF =F^∗∗. It follows that the previous estimation problem is equivalent to the problem of estimating (α0, β0, π0, F) from the observation of n i.i.d. copies (X_i, Y_i)_1≤i≤n of (X, Y).

As we continue, the unknown c.d.f.s of X and Y will be denoted byFX and FY, respectively. Also, for any x ∈ X, the conditional c.d.f. of Y given X = x will be denoted by FY|X(·|x), and we have

FY|X(y|x) = (1−π0)F^∗(y) +π0F(y−α0−β0x), y∈R. (2) It follows that, for any x ∈ X, fY|X(·|x), the conditional p.d.f. of Y given X = x, can be expressed as

f_Y_|X(y|x) = (1−π0)f^∗(y) +π0f(y−α0−β0x), y∈R, (3) where f^∗ and f are the p.d.f.s of ε^∗ and ε, assuming that they exist on R.

Note that, as shall be discussed in Section 7, it is possible to consider a slightly more general version of this model involving an unknown scale parameter for the first component.

This more elaborate model remains identifiable and estimation through the moment method is theoretically possible. However, from a practical perspective, estimation of this scale parameter through the moment method seems quite unstable insomuch as that an alternative estimation method appears required.

(6)

3 Identifiability through the moment method

Since (1) is clearly equivalent to

Y = (1−Z)ε^∗+Z(α₀+β₀X+ε), (4) we immediately obtain that

E(Y|X) =π0α0+π0β0X a.s. (5) It follows that the coefficients γ0,1 =π0α0 and γ0,2 =π0β0 can be identified from (5) if X is not reduced to a singleton. In addition, we have

E(Y²|X) = E[{(1−Z)ε^∗+Z(α0+β0X+ε)}²|X] a.s.

= E(1−Z)E{(ε^∗)²}+E(Z)E{(α0+β0X)²+ε²|X} a.s.

= (1−π0)(σ^∗₀)²+π0 α₀²+ 2α0β0X+β₀²X²+σ²₀ a.s.

= (1−π0)(σ^∗₀)²+π0(α²₀+σ₀²) + 2π0α0β0X+π0β₀²X² a.s., (6) where σ₀^∗ and σ0 are the standard deviations of ε^∗ and ε, respectively. If X contains three pointsx1, x2, x3 such that the vectors{(1, x1, x²₁),(1, x2, x²₂),(1, x3, x²₃)}are linearly independent then, from (6), we can identify the coefficients γ0,3 = (1− π0)(σ₀^∗)² +π0(α²₀ + σ²₀), γ0,4 = 2π0α0β0 and γ0,5 =π0β₀². In other words, under the aforementioned conditions on X, we have











γ0,1 = π0α0

γ0,2 = π0β0

γ0,3 = (1−π0)(σ₀^∗)²+π0(α₀²+σ₀²) γ0,4 = 2π0α0β0 = 2α0γ0,2

γ0,5 = π0β₀² =β0γ0,2.

(7)

From the above system of equations, we see that α0, β0 and π0 can be identified provided π₀β₀ 6= 0, that is, provided the unknown component actually exists and its slope is non zero.

The latter condition will be assumed to hold in the rest of the paper.

Let us now consider the functional part F of the model. For anyη = (α, β)∈R², denote byJ(·,η) the c.d.f. defined by

J(t,η) = Pr(Y ≤t+α+βX), t∈R. (8) For any t∈R, this can be rewritten as

J(t,η) = Z

R

FY|X(t+α+βx|x)dFX(x)

=(1−π0) Z

R

F^∗(t+α+βx)dFX(x) +π0

Z

R

F{t+ (α−α0) + (β−β0)x}dFX(x).

(7)

Forη =η₀ = (α₀, β₀), we then obtain J(t,η₀) = (1−π0)

Z

R

F^∗(t+α0+β0x)dFX(x) +π0F(t), t∈R. Now, for any η∈R², let K(·,η) be defined by

K(t,η) = Z

R

F^∗(t+α+βx)dFX(x), t∈R. (9) It follows that F is identified since

F(t) = 1

π₀ {J(t,η₀)−(1−π0)K(t,η₀)}, t ∈R. (10) The above equation is at the root of the derivation of an estimator forF.

4 Estimation

LetP be the probability distribution of (X, Y). For ease of exposition, we will frequently use the notation adopted in the theory of empirical processes in the sense of van der Vaart and Wellner (2000) or Kosorok (2008) for instance. Given a measurable function f : R² →R^k, for some integer k ≥ 1, P f will denote the integral R

fdP. Also, the empirical measure obtained from the random sample (Xi, Yi)1≤i≤n will be denoted by P_n = n⁻¹Pn

i=1δXi,Yi, where δx,y is the probability distribution that assigns a mass of 1 at (x, y). The expectation off under the empirical measure is thenP_nf =n⁻¹Pn

i=1 f(Xi, Yi) and the quantityG_nf =

√n(P_nf−P f) is the empirical processevaluated atf. The arrow ‘ ’ will be used to denote weak convergence in the sense of Definition 1.3.3 of van der Vaart and Wellner (2000) and, for any setS,ℓ^∞(S) will stand for the space of all bounded real-valued functions onS equipped with the uniform metric. Key results and more details can be found for instance in van der Vaart (1998), van der Vaart and Wellner (2000) and Kosorok (2008).

4.1 Estimation of the Euclidean parameter vector

To estimate the Euclidean parameter vector (α0, β0, π0)∈R×R\ {0} ×(0,1], we first need to estimate the vector γ₀ = (γ_0,1, . . . , γ_0,5)∈R⁵ whose components were expressed in terms of α0, β0 and π0 in the previous section. From (5) and (6), it is natural to consider the regression function

d_n(γ) = P_nϕγ, γ ∈R⁵, where, for any γ∈R⁵, ϕγ :R² →R is defined by

ϕγ(x, y) = (y−γ₁ −γ₂x)² + (y²−γ₃−γ₄x−γ₅x²)², x, y ∈R.

(8)

As an estimator ofγ₀, we then naturally consider γ_n= arg minγd_n(γ) that satisfies d˙n(γ_n) =P_nϕ˙γ_n = 0,

where ˙ϕγ, the gradient of ϕγ with respect to γ, is given by

˙

ϕγ(x, y) = −2







y−γ1 −γ2x x(y−γ₁−γ₂x) y²−γ3−γ4x−γ5x² x(y²−γ3−γ4x−γ5x²) x²(y²−γ₃−γ₄x−γ₅x²)







, x, y ∈R.

Now, for any integers p, q ≥1, define

X^pY^q = 1 n

n

X

i=1

X_i^pY_i^q, and let

Γn= 2







1 X 0 0 0

X X² 0 0 0

0 0 1 X X²

0 0 X X² X³ 0 0 X² X³ X⁴







and θ_n= 2





 Y XY

Y² XY² X²Y²





 ,

which respectively estimate

Γ0 = 2







1 E(X) 0 0 0

E(X) E(X²) 0 0 0

0 0 1 E(X) E(X²)

0 0 E(X) E(X²) E(X³) 0 0 E(X²) E(X³) E(X⁴)







and θ₀ = 2







E(Y) E(XY)

E(Y²) E(XY²) E(X²Y²)





 .

The linear equation P_nϕ˙γ_n = 0 can then equivalently be rewritten as Γnγ_n =θ_n. Provided the matrices Γn and Γ0 are invertible, we can write γ_n = Γ⁻¹_n θ_n and γ₀ = Γ⁻¹₀ θ₀.

To obtain an estimator of (α0, β0, π0), we use the relationships induced by (5) and (6) and recalled in (7). Leaving the third equation aside because it involves the unknown standard deviation σ0 of ε, we obtain three possible estimators of α0:

αn= γn,1γn,5

γ_n,2² , αn = γn,4

2γn,2

, or αn = γ_n,4² 4γn,1γn,5

,

three possibles estimators of β0: βn = γn,5

γn,2

, βn = γn,4

2γn,1

, or βn= γn,2γ_n,4² 4γn,5γ_n,1² ,

(9)

and, three possibles estimators of π₀: π_n = γ_n,2²

γn,5

, π_n = 2γn,1γn,2

γn,4

, or π_n = 4γ_n,1² γn,5

γ_n,4² .

There are therefore 27 possible estimators of (α0, β0, π0). Their asymptotics can be obtained under very reasonable conditions. Unfortunately, all 27 estimators turned out to behave quite poorly in small samples. This prompted us to look for alternative estimators within the “same class”.

We now describe an estimator of (α0, β0, π0) that was obtained empirically and that be- haves significantly better for small samples than the aforementioned ones. The new regression function under consideration is dn(γ) =P_nϕγ, γ ∈R⁸, where, for any (x, y)∈R²,

ϕγ(x, y) = (y−γ1−γ2x)²+ (y²−γ3−γ4x²)²+ (x−γ5)²+ (x²−γ6)²+ (x³−γ7)²+ (x⁴−γ8)². Now, let

Γn= 2







1 X 0 0 0 0 0 0

X X² 0 0 0 0 0 0

0 0 1 X² 0 0 0 0

0 0 X² X⁴ 0 0 0 0

0 0 0 0 1 0 0 0

0 0 0 0 0 1 0 0

0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 1







and θ_n= 2





 Y XY

Y² X²Y²

X X² X³ X⁴





 ,

which respectively estimate

Γ0 = 2







1 E(X) 0 0 0 0 0 0

E(X) E(X²) 0 0 0 0 0 0

0 0 1 E(X²) 0 0 0 0

0 0 E(X²) E(X⁴) 0 0 0 0

0 0 0 0 1 0 0 0

0 0 0 0 0 1 0 0

0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 1







and θ₀ = 2







E(Y) E(XY)

E(Y²) E(X²Y²)

E(X) E(X²) E(X³) E(X⁴)





 .

Then, proceeding as previously, provided the matrices Γnand Γ0are invertible, the estimator γ_n = arg minγdn(γ) of γ₀ = Γ⁻¹₀ θ₀ is given by γ_n = Γ⁻¹_n θ_n. To obtain an estimator of (α0, β0, π0), we have, from the second term of the regression function, that

γ0,4 = cov(X², Y²)

V(X²) = cov(X², Y²) γ0,8−γ_0,6² ,

(10)

where the second equality comes from the fact that γ_0,6 = E(X²) and γ_0,8 = E(X⁴). Now, using (4), we find

cov(X², Y²) =π₀β₀²V(X²) + 2π₀α₀β₀cov(X², X), which, combined with the fact that γ0,1 =π0α0 and γ0,2 =π0β0, gives

cov(X², Y²) =γ0,2β0(γ0,8−γ_0,6² ) + 2γ0,1β0(γ0,7−γ0,5γ0,6).

This leads to the following estimator of (α0, β0, π0):

βn=g^β(γ_n) = γn,4

γn,2+ 2γn,1(γn,7−γn,5γn,6)/(γn,8−γ_n,6² ), πn=g^π(γ_n) = γn,2

β_n , αn=g^α(γ_n) = γn,1

πn

.

As we continue, the subsets of R⁸ on which the functions g^α, g^β and g^π exist and are differentiable will be denoted by D^α, D^β and D^π, respectively, and D^α,β,π will stand for D^α∩ D^β ∩ D^π.

To derive the asymptotic behavior of the estimator (αn, βn, πn) = (g^α(γ_n), g^β(γ_n), g^π(γ_n)), we consider the following assumptions:

A1. (i) X has a finite fourth order moment; (ii) X has a finite eighth order moment.

A2. V(X)>0 and V(X²)>0.

Clearly, Assumption A1 (ii) implies Assumption A1 (i), and Assumption A2 implies that the matrix Γ0 is invertible.

The following result, proved in Appendix A, characterizes the asymptotic behavior of the estimator (αn, βn, πn).

Proposition 4.1. Assume that γ₀ ∈ D^α,β,π.

(i) Under Assumptions A1 (i) and A2, (αn, βn, πn)−−→^a.s. (α0, β0, π0).

(ii) Suppose that Assumptions A1 (ii) and A2 are satisfied and let Ψγ be the 3 by 8 matrix defined by

Ψγ =







∂g^α

∂γ¹ · · · ^∂g_∂γ^α8

∂g^β

∂γ1 · · · ^∂g_∂γ^β8

∂g^π

∂γ1 · · · ^∂g_∂γ^π8





(γ), γ∈ D^α,β,π.

Then, √

n(αn−α0, βn−β0, πn−π0) = −G_n(Ψγ0Γ⁻¹₀ ϕ˙γ0) +oP(1).

(11)

As a consequence, √

n(α_n−α₀, β_n−β₀, π_n−π₀) converges in distribution to a centered normal random vector with covariance matrix Σ = Ψγ0Γ⁻¹₀ P( ˙ϕγ0ϕ˙^⊤_γ₀)Γ⁻¹₀ Ψ^⊤_γ₀, which can be consistently estimated by Σn = Ψγ_nΓ⁻¹_n P_n( ˙ϕγ_nϕ˙^⊤_γ

n)Γ⁻¹_n Ψ^⊤_γ

n in the sense that Σn

−−→a.s. Σ.

An immediate consequence of the previous result is that large-sample standard errors of αn, βn and πn are given by the square root of the diagonal elements of the matrix n⁻¹Σn. The finite-sample performance of these estimators is investigated in Section 5 and they are used in the illustrations of Section 6.

4.2 Estimation of the functional parameter

To estimate the unknown c.d.f. F of ε, it is natural to start from (10). For a known η = (α, β)∈ R², the term J(·,η) defined in (8) may be estimated by the empirical c.d.f. of the random sample (Yi−α−βXi)1≤i≤n, i.e.,

Jn(t,η) = 1 n

n

X

i=1

1(Yi−α−βXi ≤t), t ∈R.

Similarly, sinceF^∗ (the c.d.f. ofε^∗) is known, a natural estimator of the termK(t,η) defined in (9) is given by the empirical mean of the random sample {F^∗(t+α+βXi)}1≤i≤n, i.e.,

Kn(t,η) = 1 n

n

X

i=1

F^∗(t+α+βXi), t∈R.

To obtain estimators of J(·,η₀) and K(·,η₀), it is then natural to consider the plug-in estimators J_n(·,η_n) and K_n(·,η_n), respectively, based on the estimator η_n = (α_n, β_n) = (g^α, g^β)(γ_n) of η₀ proposed in the previous subsection.

We shall therefore consider the following nonparametric estimator of F : Fn(t) = 1

πn{Jn(t,η_n)−(1−πn)Kn(t,η_n)}, t∈R. (11) Note that Fn is not necessarily a c.d.f. as it is not necessarily increasing and can be smaller than zero or greater than one. In practice, we shall consider the partially corrected estimator (F_n∨0)∧1, where ∨ and ∧ denote the maximum and minimum, respectively.

To derive the asymptotic behavior of the previous estimator, we consider the following additional assumptions on the p.d.f.s f^∗ and f of ε^∗ and ε, respectively:

A3. (i)f^∗ and f exist and are bounded on R; (ii) (f^∗)^′ and f^′ exist and are bounded onR.