HAL Id: hal-00796296
https://hal.archives-ouvertes.fr/hal-00796296
Preprint submitted on 3 Mar 2013
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
An asymptotic test for Quantitative Trait Locus
detection in presence of missing genotypes
Charles-Elie Rabier
To cite this version:
Charles-Elie Rabier. An asymptotic test for Quantitative Trait Locus detection in presence of missing
genotypes. 2013. �hal-00796296�
An asymptotic test for Quantitative Trait Locus
detection in presence of missing genotypes
Charles-Elie Rabier
Summary. We consider the likelihood ratio test (LRT) process related to the test
of the absence of QTL (a QTL denotes a quantitative trait locus, i.e. a gene with quantitative effect on a trait) on the interval [0, T ] representing a chromosome. The originality is in the fact that some genotypes are missing. We give the asymptotic distribution of this LRT process under the null hypothesis that there is no QTL on [0, T ]and under local alternatives with a QTL at t?on [0, T ]. We show that the LRT
process is asymptotically the square of a “ non-linear interpolated and normalized Gaussian process ”. We have an easy formula in order to compute the supremum of the square of this interpolated process. We prove that the threshold is exactly the same as in the classical situation without missing genotypes.
Keywords: Gaussian process, Likelihood Ratio Test, Mixture models, Nuisance parameters present only under the alternative, QTL detection.
1. Introduction
We study a backcross population: A × (A × B), where A and B are purely homozygous lines and we address the problem of detecting a Quantitative Trait Locus, so-called QTL (a gene inuencing a quantitative trait which is able to be measured) on a given chromosome. The trait is observed on n individuals (progenies) and we denote by Yj, j = 1, ..., n, the observations, which we will
assume to be Gaussian, independent and identically distributed (i.i.d.). The mechanism of genetics, or more precisely of meiosis, implies that among the two chromosomes of each individual, one is purely inherited from A while the other (the recombined one), consists of parts originated from A and parts originated from B, due to crossing-overs.
The chromosome will be represented by the segment [0, T ]. The distance on [0, T ]is called the genetic distance, it is measured in Morgans (see for instance Wu et al. [2007] or Siegmund and Yakir [2007]). The genome X(t) of one in-dividual takes the value +1 if, for example, the recombined chromosome is originated from A at location t and takes the value −1 if it is originated from B . We use the Haldane modeling that can be represented as follows: X(0) is a random sign and X(t) = X(0)(−1)N (t) where N(.) is a standard Poisson process
on [0, T ]. Calculations on the Poisson distribution show that r(t, t0) := P(X(t)X(t0) = −1) = P(|N (t) − N (t0)| odd) = 1 2 (1 − e −2|t−t0|), we set in addition ¯ r(t, t0) = 1 − r(t, t0).
We assume an analysis of variance model for the quantitative trait :
Y = µ + X(t?) q + σε (1)
where ε is a Gaussian white noise and t∗ is the true location of the QTL.
Usually, in the classical problem of detecting a QTL on a chromosome, the genome information is available only at xed locations t1= 0 < t2< ... < tK =
T, called genetic markers. So, usually an observation is (Y, X(t1), ..., X(tK)) ,
The originality of this paper is that we consider the classical problem, but this time, we consider two real thresholds S− and S+ with S− 6 S+ and the
genotype of one individual is available if and only if the phenotype Y belongs to the interval S−6 Y 6 S+ . If we call X(t) the random variable such as
X(t) = (
X(t) if Y ∈ [S− , S+]
0 otherwise , then, in our problem, one observation will be now
Y, X(t1), ..., X(tK) .
Note that with our notations :
• when Y ∈ [S−, S+], we have X(t1) = X(t1), ..., X(tK) = X(tK).
• when Y /∈ [S− , S+], we have X(t1) = 0, ..., X(tK) = 0, which means that
the genome information is missing at the marker locations.
Note also that, in this paper, the word genotype" will refer to the genome information at markers locations.
We will observe n observations Yj, Xj(t1), ..., Xj(tK)i.i.d.
It can be proved (see Section 2) that Y, X(t1), ..., X(tK)obeys to a
mix-ture model with known weights, times a function g(.) which does not depend of the parameters µ, q and σ :
p(t∗) f (µ+q,σ)(y) 1y∈[S−,S+] + {1 − p(t ∗)} f (µ−q,σ)(y) 1y∈[S−,S+] + 1 2 f(µ+q,σ)(y) 1y /∈[S−,S+] + 1 2 f(µ−q,σ)(y) 1y /∈[S−,S+] g(.) (2)
where f(m,σ) is the Gaussian density with parameters (m, σ) and where the
function p(t) is fully given in Section 2.
As said before, the challenge is that t∗ is unknown. So, at every location
t ∈ [0, T ], we perform a Likelihood Ratio Test (LRT), Λn(t), of the hypothesis
“q = 0”. It leads to a LRT process Λn(.)and taking as test statistic the maximum
of this process comes down to perform a LRT in a model when the localisation of the QTL is an extra parameter.
In the classical problem of detecting a QTL on a chromosome, that is to say in the oracle situation where all the individuals are genotyped, the asymptotic distribution of the LRT statistic has been given under some approximations by Rebaï et al. [1995], Rebaï et al. [1994], Cierco [1998], Azaïs and Cierco-Ayrolles [2002], Azaïs and Wschebor [2009], Chang et al. [2009]. Recently, Azaïs et al. [2012] have shown that the distribution of the LRT statistic is asymptotically that of the maximum of the square of a non linear normalized interpolated process.
Until now, in QTL detection, dierent strategies have been studied in order to reduce costs of genotyping. A very famous one is called selective genotyping. It consists of genotyping only the individuals who present an extreme phenotype (i.e. the individuals for which Y 6 S− or Y > S+). Selective genotyping
has been studied theoretically by many authors : for instance Lebowitz and al. [1987], Lander and Botstein [1989], Darvasi and Soller [1992], Muranty and
Gonet [1997], Rabier [2012a]... However, in all these articles, the focus is only on one xed location of the genome. Recently, in Rabier [2012b], we focused on the whole chromosome, and we proved that the distribution of the LRT statistic was asymptotically that of the maximum of the square of a non linear normalized interpolated process.
In this paper, our goal is not to focus on tools for reducing costs due to geno-typing (as for selective genogeno-typing), but to help geneticists to analyze data in the case of missing genotypes. According to Arends et al. [2010], in an ideal world all datasets would be complete (with the genotype for every individual at every marker determined), however in the real world datasets are often incomplete". As a consequence, the originality of this paper is in the fact that we study a problem which has never been studied theoretically before : the detection of a QTL on a chromosome when only the genotypes of the non extreme individuals (i.e. the individuals for which the phenotypes Y belong to the interval [S−, S+])
are available. The main result of the paper (Theorems 1 and 2) is that the distri-bution of the LRT statistic is asymptotically that of the maximum of the square of a non linear normalized interpolated process. This is a generalization of the results obtained by Azaïs et al. [2012] only for the oracle situation. Under the null hypothesis, despite the missing genotypes, our process is exactly the same as the one obtained by Azaïs et al. [2012]. However, under the alternative, we show that the mean functions of the two processes are not the same anymore.
Some important results are also introduced in Theorem 3 and Lemma 3. In Theorem 3, we give the Asymptotic Relative Eciency (ARE) with respect to the oracle situation. In Lemma 3, we present an easy formula (see also formula 21) to compute the maximum of the square of the non linear interpolated process. This formula is original. Usually when we look for a QTL on a chromosome with missing genotypes, we have to compute an EM algorithm at each location, so it is quite challenging. With our formula, we don't need to perform any EM algorithm and we only have to focus on given locations on the chromosome. Note that in this paper, we also prove that the extreme phenotypes (for which the genotypes are missing) don't bring any extra information for statistical inference. This result is complementary to the one obtained in Rabier [2012b], where I show that, under selective genotyping, the non extreme phenotypes don't bring any information for statistical inference.
To conclude, we will illustrate our theoretical results with the help of simu-lated data. Note that, according to Theorem 1 and 2, the threshold (i.e. critical value) in our study is exactly the same as the classical threshold used in the oracle situation. So, in order to obtain our threshold, the Monte Carlo Quasi Monte-Carlo methods of Azaïs et al. [2012], based on Genz [1992] is still suitable here. This is an alternative to the permutation method proposed by Churchill and Doerge [1994], which is very time consuming and not easy to compute here because of the missing genotypes.
We refer to the book of Van der Vaart [1998] for elements of asymptotic statistics used in proofs.
2. Main results : two genetic markers
To begin, we suppose that there are only two markers (K = 2) located at 0 and T : 0 = t1< t2= T. We look for a QTL located at t?∈ [t1, t2]. As said before,
since t? is unknown, we have to consider every locations t ∈ [t
1, t2]. So, let's
consider a location t ∈ [t1, t2], and let's suppose t = t?.
Notations : For (i, i0) ∈ {−1, 1}2, Qi,i0
t is the quantity such as
Qi,it 0 = PX(t) = 1
X(t1) = i, X(t2) = i0 .
Using Bayes rules, we have Q1,1t = ¯ r(t1, t) ¯r(t, t2) ¯ r(t1, t2) , Q1,−1t = ¯ r(t1, t) r(t, t2) r(t1, t2) (3) Q−1,1t = r(t1, t) ¯r(t, t2) r(t1, t2) , Q−1,−1t =r(t1, t) r(t, t2) ¯ r(t1, t2) . We can remark that we have
Q−1,−1t = 1 − Q 1,1 t and Q −1,1 t = 1 − Q 1,−1 t .
Notations : Pt{l | i}is the quantity such as ∀ l ∈ {−1, 0, 1} and ∀ i ∈ {−1, 1}
Pt{l | i} = P(X(t) = l | X(t) = i).
In order to compute the likelihood, we have to study the dierent probability distributions. To begin, let's compute P(Y ∈ [y , y + dy] ∩ X(t1) = 1 ∩ X(t2) =
1)for instance. We have, according to Bayes rules (we remind that we consider t = t?), P(Y ∈ [y , y + dy] ∩ X(t1) = 1 ∩ X(t2) = 1) = X i∈{−1,1} P(Y ∈ [y , y + dy] | X(t) = i) P(X(t) = i ∩ X(t1) = 1 ∩ X(t2) = 1) . Besides,
P(Y ∈ [y , y + dy] | X(t) = i) = P(Y ∈ [y , y + dy] ∩ X 6= 0 | X(t) = i) P(X(t) 6= 0 | X(t) = i)
= f(µ+iq,σ)(y) 1y∈[S−,S+]
Pt{i | i} and P(X(t) = i ∩ X(t1) = 1 ∩ X(t2) = 1) = P(X(t) 6= 0 ∩ X(t) = i ∩ X(t1) = 1 ∩ X(t2) = 1) = Pt{i | i} P(X(t) = i ∩ X(t1) = 1 ∩ X(t2) = 1) = 1 2 Pt{1 | 1} r(t1, t) r(t, t2)1i=1 + 1 2 Pt{−1 | −1} r(t1, t) r(t, t2)1i=−1 . It comes, using formula (3),
P(Y ∈ [y , y + dy] ∩ X(t1) = 1 ∩ X(t2) = 1) = 1 2 f(µ+q,σ)(y) 1y∈[S−,S+] r(t1, t2) Q 1,1 t + 1 2 f(µ−q,σ)(y) 1y∈[S−,S+]r(t1, t2) Q −1,−1 t .
In the same way, after some calculations, we nd P(Y ∈ [y , y + dy] ∩ X(t1) = 1 ∩ X(t2) = −1) = 1 2 f(µ+q,σ)(y) 1y∈[S−,S+] r(t1, t2) Q 1,−1 t + 1 2 f(µ−q,σ)(y) 1y∈[S−,S+] r(t1, t2) Q −1,1 t , P(Y ∈ [y , y + dy] ∩ X(t1) = −1 ∩ X(t2) = 1) = 1 2 f(µ+q,σ)(y) 1y∈[S−,S+] r(t1, t2) Q −1,1 t + 1 2 f(µ−q,σ)(y) 1y∈[S−,S+] r(t1, t2) Q 1,−1 t , P(Y ∈ [y , y + dy] ∩ X(t1) = −1 ∩ X(t2) = −1) = 1 2 f(µ+q,σ)(y) 1y∈[S−,S+] r(t1, t2) Q −1,−1 t + 1 2 f(µ−q,σ)(y) 1y∈[S−,S+]r(t1, t2) Q 1,1 t .
Finally, when the genotype is missing (i.e. the phenotype is extreme), we nd P(Y ∈ [y , y + dy] ∩ X(t1) = 0 ∩ X(t2) = 0)
=1
2 f(µ+q,σ)(y) 1y /∈[S−,S+] +
1
2 f(µ−q,σ)(y) 1y /∈[S−,S+] .
Let's dene the quantity p(t) such as p(t) = Q1,1t 1X(t 1)=11X(t2)=1 + Q 1,−1 t 1X(t1)=11X(t2)=−1 + Q−1,1t 1X(t 1)=−11X(t2)=1 + Q −1,−1 t 1X(t1)=−11X(t2)=−1 (4)
and let θ = (q, µ, σ) be the parameter of the model at t xed. As a consequence, the likelihood of the triplet Y, X(t1), X(t2)
with respect to the measure λ ⊗ N ⊗ N, λ being the Lebesgue measure, N the counting measure on N, is ∀t ∈ [t1, t2]:
Lt(θ) = p(t) f(µ+q,σ)(y)1y∈[S−,S+]+ {1 − p(t)} f(µ−q,σ)(y)1y∈[S−,S+] (5)
+ 1 2 f(µ+q,σ)(y)1y /∈[S−,S+] + 1 2 f(µ−q,σ)(y)1y /∈[S−,S+] g(t) where the function
g(t) = 1 2 n ¯ r(t1, t2) 1X(t1)=11X(t2)=1 + r(t1, t2) 1X(t1)=11X(t2)=−1 o +1 2 n r(t1, t2) 1X(t1)=−11X(t2)=1 + ¯r(t1, t2) 1X(t1)=−11X(t2)=−1 o + 1X(t 1)=01X(t2)=0
can be removed because it does not depend on the parameters. Note that for t = t?, we nd our formula (2) of the introduction where p(t?) is described in
formula (4).
Notations : γ, γ+ and γ− are respectively the quantities
PH0(Y /∈ [S−, S+]), PH0(Y > S+)and PH0(Y < S−).
B = σ2 1 − γ − z
γ+ ϕ(zγ+) + z1−γ− ϕ(z1−γ−)
, where ϕ(x) and z
α denote
respectively the density of a standard normal distribution taken at the point x, and the quantile of order 1 − α of a standard normal distribution.
Our main result is the following
Theorem 1. Suppose that the parameters (q, µ, σ2) vary in a compact and
that σ2 is bounded away from zero. Let H
0 be the null hypothesis q = 0 and
dene the following local alternative
Hat? :the QTL is located at the position t? with eect q = a/
√
nwhere a 6= 0 . With the previous dened notations,
Sn(.) ⇒ U (.) , Λn(.) F.d. −→ U2(.) , sup Λ n(.) L −→ sup U2(.)
as n tends to innity, under H0 and Hat? where :
• Sn(.)is the score process
• ⇒ is the weak convergence, F.d.→ is the convergence of nite-dimensional distributions and L
−→ is the convergence in distribution • U (.)is the Gaussian process with unit variance such as :
U (t) = p α(t)U (t1) + β(t)U (t2) V {α(t)U (t1) + β(t)U (t2)}
(6) where
Cov {U(t1), U(t2)} = ρ(t1, t2) = exp(−2|t1− t2|)
α(t) = Q1,1t − Q−1,1t , β(t) = Q1,1t − Q1,−1t and with expectation :
• under H0, m(t) = 0, • under Hat? mt?(t) = α(t) mt?(t1) + β(t) mt?(t2) p V {α(t)U (t1) + β(t)U (t2)} where mt?(t1) = a√B ρ(t1, t?) σ2 , mt?(t2) = a√B ρ(t?, t 2) σ2 .
In the sense of this equation, U(.) will be called a "non linear normalized inter-polated process". We can see that under the null hypothesis, despite the missing genotypes, U(.) is exactly the same process as the process Z(.) of Theorem 2.1 of Azaïs et al. [2012] obtained for the oracle situation. However, under the al-ternative, the mean functions of the two processes are not the same anymore : the mean functions are proportional of a factor √B/σ. Note also that U(.) is the generalization of Z(.). Indeed, if we choose S− = −∞and S+ = +∞, that
is to say the genotypes of all the individuals are available, the factor √B/σ is equal to 1, and U(.) is the same process as Z(.).
Fisher Information Matrix
Let lt(θ)be the loglikelihood. We rst compute the Fisher information at a point
θ0 that belongs to H0. We have
∂lt ∂q |θ0 = y − µ σ2 {2p(t) − 1} 1y∈[S−,S+] , (7) ∂lt ∂µ |θ0 = y − µ σ2 , ∂lt ∂σ |θ0= − 1 σ + (y − µ)2 σ3 . Then, EH0 ( ∂lt ∂q |θ0 2) = EH0 ( y − µ σ2 2 {2p(t) − 1}21y∈[S−,S+] ) . Let's introduce two key lemmas :
Lemma 1. We have the following relationship :
{2p(t) − 1} 1y∈[S−,S+]= α(t)X(t1) + β(t)X(t2) with α(t) = Q1,1 t − Q −1,1 t and β(t) = Q 1,1 t − Q 1,−1 t .
To prove this lemma, use formula (4) and check that both sides coincide when y ∈ [S−, S+].
Lemma 2. Let W ∼ N(µ, σ2), then :
i) E n (W − µ)21W ∈[S −, S+] o = σ2− σ2 P(W /∈ [S−, S+]) − σ (S+− µ) ϕ S +−µ σ + σ (S−− µ) ϕ S −−µ σ ii) En(W − µ)1W ∈[S −, S+] o = −σ ϕ S +−µ σ + σ ϕ S −−µ σ
To prove this lemma, use integration by parts.
According to i) of Lemma 2, we have B = EH0(y − µ)
21
y∈[S−,S+]
. It comes, according to Lemma 1 :
EH0 ( ∂lt ∂q |θ0 2) (8) = EH0 " y − µ σ2 2 α(t)X(t1) + β(t)X(t2) 2 # = EH0 " y − µ σ2 2 {α(t)X(t1) + β(t)X(t2)} 2 1y∈[S−,S+] # = EH0 ( y − µ σ2 2 1y∈[S−,S+] ) EH0 h {α(t)X(t1) + β(t)X(t2)}2 i = Bnα2(t) + β2(t) + 2α(t)β(t)e−2(t2−t1)o/σ4.
To conclude, after some calculations, we nd Iθ0= Diag Bnα2(t) + β2(t) + 2α(t)β(t)e−2(t2−t1)o/σ4, 1 σ2 , 2 σ2 . (9) Only the computation of EH0
n − ∂lt
∂q∂µ |θ0o and EH0
n − ∂lt
∂q∂σ |θ0o, were not easy.
Let's prove now why these two terms are equal to zero. We have ∂lt
∂q∂µ |θ0= −
2p(t) − 1
σ2 1y∈[S−,S+] .
It comes, using Lemma 1, EH0 − ∂lt ∂q∂µ |θ0 = − 1 σ2 EH0α(t)X(t1) + β(t)X(t2) = − 1 σ2 EH0[α(t)X(t1) + β(t)X(t2) | y ∈ [S−, S+] ] PH0(y ∈ [S−, S+]) = − 1 σ2 EH0{α(t)X(t1) + β(t)X(t2)} PH0(y ∈ [S−, S+]) = 0. Besides, ∂lt ∂q∂σ |θ0= − 2 σ3(y − µ) {2p(t) − 1} 1y∈[S−,S+] . It comes EH0 ∂l t ∂q∂σ |θ0 (10) = − 2 σ3 EH0(y − µ)1y∈[S−,S+] EH0{α(t)X(t1) + β(t)X(t2)} = 0.
It concludes the proof for the Fisher Information matrix. Study of the score process under H0
Since the Fisher Information matrix is diagonal, the score statistic (for n obser-vations) of the hypothesis q = 0 will be dened as
Sn(t) = ∂ln t ∂q |θ0 r V ∂ln t ∂q |θ0 .
Now using formula (7) and using Lemma 1, it is clear that ∂lnt ∂q |θ0 = n X j=1 yj− µ σ2 {2pj(t) − 1} 1yj∈[S−,S+] =α(t) σ n X j=1 εj Xj(t1) + β(t) σ n X j=1 εjXj(t2) (11)
this proves that U(.) is a non linear interpolated process.
On the other hand, according to formula (7) and Lemma 1, we have ∀k = 1, 2 : Sn(tk) = n X j=1 σεjXj(tk) √ n B . We have : Eσε X(tk) = E σ ε 1y∈[S−,S+]| X(tk) = 1 P {X(tk) = 1} − E σ ε 1y∈[S−,S+]| X(tk) = −1 P {X(tk) = −1} = E σ ε 1y∈[S−,S+] /2 − E σ ε 1y∈[S−,S+] /2 = 0 . Besides : E h σ2ε2X(tk) 2i = E(σ2ε21y∈[S−,S+]) = B .
According to the Central Limit Theorem, it comes Sn(tk)
L
−→ N (0, 1) .
Let's compute the covariance of the score statistics on markers, i.e. Cov {Sn(t1), Sn(t2)}.
Since E (y − µ)21 y∈[S−,S+] = B, we have : E {Sn(t1)Sn(t2)} = 1 B E(y − µ) 2X(t 1) X(t2) 1y∈[S−,S+] = 1 B E(y − µ) 21 y∈[S−,S+] E {X(t1)X(t2)} = e−2(t2−t1) .
As a consequence, Cov {Sn(t1), Sn(t2)} = e−2(t2−t1). The weak convergence of
the score process, Sn(.), is then a direct consequence of (11), the convergence of
(Sn(t1), Sn(t2))and the Continuous Mapping Theorem.
Study under the local alternative
Let's consider a local alternative dened by t∗ and q = a/√n.
It remains to compute the asymptotic distribution of Sn(.) under this
al-ternative. Since we have already proved that Sn(.) is a non linear interpolated
process (see formula 11), we only need to compute the distribution of Sn(t1)and
Sn(t2) under the alternative. The mean function of the process is obviously a
non linear interpolated function (same interpolation as previously). So, let's consider the score statistic at location tk ∀k = 1, 2. We have
Sn(tk) = n X j=1 (yj− µ) Xj(tk) √ n B = n X j=1 qXj(t?) Xj(tk) √ n B + n X j=1 σεjXj(tk) √ n B .
We will see, that we can apply the Law of Large Numbers for the rst term and the Central Limit Theorem for the second term. To begin, let's focus on the second term. So, rst we compute
Eσε X(tk) (12) = 1 2 Eσε X(tk) | X(t ?) = 1 + 1 2 Eσε X(tk) | X(t ?) = −1 . We have Eσε X(tk) | X(t?) = 1 = Eσε 1y∈[S−,S+] | X(t ?) = 1 ¯r(t k, t?) − Eσε 1y∈[S−,S+] | X(t ?) = 1 r(t k, t?) = Eσε 1y /∈[S−,S+] | X(t ?) = 1 e−2|tk−t?| .
Besides, according to ii) of Lemma 2, Eσε 1y∈[S−,S+] | X(t ?) = 1 = −σϕ S+− µ − q σ + σϕ S−− µ − q σ . It comes Eσε X(tk) | X(t?) = 1 = e−2|tk−t ?| σ −ϕ S+− µ − q σ + ϕ S−− µ − q σ . (13) In the same way, after some calculations, we obtain
Eσε X(tk) | X(t?) = −1 = e−2|tk−t ?| σ ϕ S+− µ + q σ − ϕ S−− µ + q σ . (14) Since we consider q small, using a Taylor expansion at rst order, we obtain for instance : ϕ S−− µ + q σ =√1 2π e −1 2 S−− µ σ 2 1 − (S−− µ) q σ2 + o(q) . Finally, using Taylor expansions in formulae (13) and (14), we have :
Eσε X(tk) = e−2|tk−t ?| q −zγ+ϕ(zγ+) + z1−γ− ϕ(z1−γ−) + o(q) . It comes E n X j=1 σεj Xj(tk) √ n B → e −2|tk−t?| √ B a −zγ+ ϕ(zγ+) + z1−γ−ϕ(z1−γ−) . (15) We have now just to remark that
E σε X(tk) 2 = Eσ2ε21y∈[S−,S+] = Eσ2ε21 y∈[S−,S+]| X(t ?) = 1 /2 + E σ2ε21 y∈[S−,S+]| X(t ?) = −1 /2 → B/2 + B/2 → B.
It comes V n X j=1 σεj Xj(tk) √ n B → 1, and according to the Central Limit Theorem
n X j=1 σεj Xj(tk) √ n B L −→ N e −2|tk−t?| √ B a −zγ+ ϕ(zγ+) + z1−γ−ϕ(z1−γ−) , 1 . (16) Let us focus now on the rst term of the score statistic. We have
EX(t?) X(tk) (17) = 1 2 Pt?{1 | 1} EX(tk) | X(t ?) = 1 − 1 2 Pt?{−1 | −1} EX(tk) | X(t ?) = −1 = 1 2 e −2|tk−t?|(P t?{1 | 1} + Pt?{−1 | −1}) .
Using Taylor expansion and after some work on integrals, we have : Pt?{1 | 1} = −Φ S−− µ σ + q σϕ S−− µ σ + Φ S+− µ σ − q σϕ S+− µ σ + o(q) where Φ(.) is the cumulative distribution of the standard normal distribution.
Note that we can replace q by −q in order to obtain the expression of Pt?{−1 | −1}.
It comes EX(t?) X(tk) = e−2|tk−t ?| −Φ S−− µ σ + Φ S+− µ σ + o(q) = e−2|tk−t?|(1 − γ) + o(q) .
As a consequence, according to the Law of Large Numbers,
n X j=1 qXj(t?) Xj(tk) √ n B → a e−2|tk−t?|(1 − γ) √ B . (18)
Finally, using formulae (16) and (18), we obtain Sn(tk) L −→ N (a √ B σ2 e −2|tk−t?|, 1) , (19)
which concludes the proof.
Study of the supremum of the LRT process Let ln
t(θ) be the log likelihood for n observations log likelihood. Let ltn(bθ) be
the maximized log likelihood and let ln
t(bθ|H0) be the maximized log likelihood
under H0, with bθ|H0 = 0, Y =P Yj/n, 1/nP(Yj− Y )
2(the genetic markers are useless under H0). The likelihood ratio statistics will be dened as
on n independent observations.
Since the model with t xed is regular, it is easy to prove that for xed t Λn(t) = Sn2(t) + oP(1)
under the null hypothesis. Our goal is now to prove that the rest above is uniform in t.
Let us consider now t as an extra parameter. Let t∗, θ∗be the true parameter
that will be assumed to belong to H0. Note that t∗makes no sense for θ belonging
to H0. It is easy to check that at H0the Fisher information relative to t is zero
so that the model is not regular.
It can be proved that assumptions 1, 2 and 3 of Azaïs et al. [2009] holds. So, we can apply Theorem 1 of Azaïs et al. [2009] and we have
sup (t,θ) lt(θ) − lt∗(θ∗) = sup d∈D 1 √ n n X j=1 d(Xj) 2 1d(Xj)>0 + oP(1) (20)
where the observation Xj stands for Yj, Xj(t1), Xj(t2)and where D is the set
of scores dened in Azaïs et al. [2009], see also Gassiat [2002] and Azaïs et al. [2006]. A similar result is true under H0 with a set D0. Let us precise the sets
of scores D and D0. This sets are dened at the sets of scores of one parameter
families that converge to the true model pt∗,θ∗ and that are dierentiable in
quadratic mean. It is easy to see that
D =n hW, l 0 t(θ∗)i p V(hW, lt0(θ∗)i) , W ∈ R3, t ∈ [t1, t2] o
where l0 is the gradient with respect to θ. In the same manner
D0=
n hW, l0t(θ∗)i p
V(hW, l0t(θ∗)i)
, W ∈ R2o,
where now the gradient is taken with respect to µ and σ only. Of course this gradient does not depend on t.
Using the transform W → −W in the expressions of the sets of score, we see that the indicator function can be removed in formula (20). Then, since the Fisher information matrix is diagonal (see formula (9)) , it is easy to see that
sup d∈D 1 √ n n X j=1 d(Xj) 2 − sup d∈D0 1 √ n n X j=1 d(Xj) 2 = sup t∈[t1,t2] 1 √ n n X j=1 ∂lt ∂q(Xj) |θ0 r V n ∂lt ∂q(Xj) |θ0 o 2 .
This is exactly the desired result. Note that the model with t∗ xed is
dieren-tiable in quadratic mean, this implies that the alternative denes a contiguous sequence of alternatives. By Le Cam's rst lemma, relation (20) remains true under the alternative.
Remark : According to the Law of Large Numbers, under the null hypoth-esis H0 and under the local alternative Hat?, 1
nP 1yj∈[S+,S−]→ 1 − γ. So, 1 − γ
corresponds asymptotically to the percentage of individuals genotyped. In the same way, γ+ (resp. γ−) corresponds asymptotically to the percentage of
non-genotyped individuals in the right tail (resp. the left tail) of the distribution. 3. An easy way to perform the statistical test
Since U(.) is a "non linear normalized interpolated process", we can use Lemma 2.2 of Azaïs et al. [2012] in order to compute easily the supremum of U2(.). Note
that this lemma is suitable here because we have exactly the same interpolation as in Theorem 2.1 of Azaïs et al. [2012]. It comes
max t∈[t1,t2] {α(t)U (t1) + β(t)U (t2)} 2 α2(t) + β2(t) + 2ρ(t 1, t2)α(t)β(t) = max U2(t1) , U2(t2) , U2(t 1) + U2(t2) − 2ρ(t1, t2)U (t1)U (t2) 1 − ρ2(t 1, t2) 1U (t2) U (t1) ∈ ] ρ(t1,t2) , ρ(t1,t2)1 [ . (21)
Note that since under H0, the process U(.) is exactly the same process as the
pro-cess Z(.) obtained by Azaïs et al. [2012], we will have exactly the same threshold as the one under the oracle situation (i.e. all the individuals genotyped). So, the Monte-Carlo Quasi Monte-Carlo method of Azaïs et al. [2012] and based on Genz [1992], is still suitable here.
Let's focus now on the data analysis. Which test statistic should we use in order to make the data analysis easy ? Indeed, when we focus only on one location of the genome which is a marker location, performing a LRT or a Wald test is time consuming : an EM algorithm is required to obtain the maximum likelihood estimators. So, since we focus here on the whole chromosome, we have to propose the easiest statistical test for geneticists.
As a consequence, ∀k = 1, 2 , let's dene now the test statistic Tn(tk) such
as Tn(tk) = Pn j=1(Yj− Y ) Xj(tk) q Pn j=1(Yj− Y )21Yj∈[S−,S+] . We introduce the following lemma.
Lemma 3. Let Tn(.)be the process such as
Tn(t) = α(t)Tn(t1) + β(t)Tn(t2) pα2(t) + β2(t) + 2ρ(t 1, t2)α(t)β(t) , then Tn(.) ⇒ U (.) and Tn2(.) ⇒ U 2(.) .
Then, for the data analysis, we just have to consider as a test statistic sup T2
n(.), which can be obtained easily using formula (21) and replacing U(t1)
and U(t2) by respectively Tn(t1) and Tn(t2). Note that, according to Lemma
3, this test has the same asymptotic properties as the test based on the test statistic sup Λn(.), which corresponds to a LRT on the whole chromosome.
On the other hand, a consequence of Lemma 3 is that the extreme phenotypes (for which the genotypes are missing) don't bring any information for statistical
inference. Indeed, our test statistics Tn(t) are based only on the non extreme
phenotypes, as soon as we replace the empirical mean Y by ˆµ, an estimator √
n consistent based only on the non extreme phenotypes (ˆµ can be obtained by the method of moments for instance). This result is complementary to the one obtained in Rabier [2012b], where I show that, under selective genotyping, the non extreme phenotypes (i.e. Y ∈ [S−, S+] in the case of the selective
genotyping) don't bring any information for statistical inference. Proof. Lemma 3
For k = 1, 2, we dene ˜T (tk)such as
˜ Tn(tk) = Pn j=1(Yj− Y ) Xj(tk) √ n B .
To begin, in order to make the proof easier, let's consider that we are under H0.
Since Y = µ + OP(1/ √ n), we have ˜ Tn(tk) = Pn j=1(Yj− µ) Xj(tk) √ n B + OP 1 √ n Pn j=1Xj(tk) √ n B .
Let's focus on the second term under H0. We have
EX(tk) = E [X(tk) | Y ∈ [S−, S+]] P(Y ∈ [S−, S+]) = E [X(tk)] (1 − γ) = 0. By Prohorov, it comes Pn j=1Xj(tk) = OP(1/ √ n). It comes ˜Tn(tk) = Sn(tk) + OP(1/ √ n)and as a consequence ˜Tn(tk) = Sn(tk) +
oP(1). As said before, the model with t∗xed is dierentiable in quadratic mean,
this implies that the alternative denes a contiguous sequence of alternatives. By Le Cam's rst lemma, the remainder converges also to 0 in probability under the alternative. So, if we apply the Multivariate Central Limit Theorem, we have now ˜Tn(t1), ˜Tn(t2) L
−→ (U (t1), U (t2))whatever the hypothesis. We set
in addition ˆ B = 1 n n X j=1 (Yj− Y )21Yj∈[S−,S+] .
We have the relationship (Tn(t1), Tn(t2)) =
q B ˆ B ˜Tn(t1), ˜Tn(t2). Since ˆB L −→ Bwhatever the hypothesis, according to Slutsly and then Continuous Mapping theorem, we have qB
ˆ B
L
−→ 1. Using Slutsky, it comes (Tn(t1), Tn(t2)) L
−→ (U (t1), U (t2)). To conclude the proof, we just have to use the Continuous
Map-ping Theorem : Tn(.) ⇒ U (.)and obviously Tn2(.) ⇒ U2(.).
4. Several markers : the “Interval Mapping‘’ of Lander and Botstein [1989] in presence of missing genotypes
In that case suppose that there are K markers 0 = t1 < t2 < ... < tK = T.
positions, and the result will be prolonged by continuity at the markers positions. For t ∈ [t1, tK]\TK where TK= {t1, ..., tK}, we dene t` and tras :
t`= sup {tk∈ TK : tk< t} , tr= inf {tk∈ TK : t < tk} .
In other words, t belongs to the Marker interval" (t`, tr).
Theorem 2. We have the same result as in Theorem 1, provided that we make some adjustments and that we redene U(.) in the following way :
• in the denition of α(t) and β(t), t1 becomes t` and t2 becomes tr
• under the null hypothesis, the process U(.) considered at marker positions is the "squeleton" of an Ornstein-Uhlenbeck process: the stationary Gaussian process with covariance ρ(tk, tk0) = exp(−2|tk− tk0|)
• at the other positions, U(.) is obtained from U(t`)and U(tr)by
interpola-tion and normalizainterpola-tion using the funcinterpola-tions α(t) and β(t) • at the marker positions, the expectation is such as mt?(tk) =a
√
B ρ(tk,t?)
σ2
• at other positions, the expection is obtained from mt?(t`) and mt?(tr) by
interpolation and normalization using the functions α(t) and β(t). The proof of the theorem is the same the proof of Theorem 1 since for a position t, we can limit our attention to the interval (t`, tr). Note that it is due
to Haldane model with Poisson increments. Another key point for the proof, is that when t?does not belong to the marker interval (t`, tr), we can still use the
section Study under the alternative" of the proof of Theorem 1 .
Another important point is that since for a position t we can limit our attention to the interval (t`, tr), Lemma 3 and formula (21) are still true here. We just
have to replace t1 and t2 by t`and trin order to have the good expressions. As
a consequence, we can easily compute sup T2 n(.).
We introduce now our Theorem 3.
Theorem 3. Let κ be the Asymptotic Relative Eciency (ARE) with respect to the oracle situation where all the genotypes are known. Then, we have
i) κ = 1 − γ − zγ+ ϕ(zγ+) + z1−γ− ϕ(z1−γ−)
ii) κ reaches its maximum for γ+= γor γ−= γ
iii) κ > 1 − γ ⇔ zγ+ϕ(zγ+) > z1−γ−ϕ(z1−γ−).
According to i) of Theorem 3, the ARE with respect to the oracle situation, does not depend on the constant a linked to the QTL eect, and does not depend on the location of the QTL t?. On the other hand, according to ii) of Theorem 3,
if only a percentage 1 − γ of genotypes is available in the population considered, the eciency of our test is maximum when all the missing genotypes are located in the right tail of the distribution (i.e. γ+ = γ). Obviously, by symmetry, the
eciency of our test is also maximum when all the missing genotypes are located in the left tail of the distribution (i.e. γ−= γ). Note also, that according to iii),
our test can reduce costs due to genotyping when zγ+ϕ(zγ+) > z1−γ− ϕ(z1−γ−).
However, this condition is very restrictive due to the properties of the Gaussian distribution.
Proof. The proof of i) is obvious since the mean function of the process U (.)and the one of the process Z(.) corresponding to the oracle situation, are proportional of a factor √B/σ. Let's now prove that the maximum is reached for γ− = γ, that is to say γ+ = 0, since γ = γ+ + γ−. Note that without
loss generality, it will also prove that the maximum is reached for γ+ = γ and
γ− = 0. We have to answer the following question : how must we choose γ+
and γ− to maximize the eciency ? We remind that γ++ γ− = γ and that
ϕ(.)and Φ(.) denote respectively the density and the cumulative distribution of the standard normal distribution. Let u(.) be the function such as : u(zγ+) =
Φ−1γ − 1 + Φ(zγ+)
. Then, z
1−γ−= u(zγ+).
Let k1(.)be the following function : k1(zγ+) = zγ+ϕ(zγ+) − u(zγ+) ϕu(zγ+)
. In order to maximize κ, we have to minimize the function k1(.). Let k10(.), u0(.)
and ϕ0(.)be respectively the derivative of k
1(.), u(.) and ϕ(.). We have :
k01(zγ+) = ϕ(zγ+) + zγ+ϕ 0(z γ+) − u 0(z γ+) ϕu(zγ+) − u(zγ+) u 0(z γ+) ϕ 0u(z γ+) , u0(zγ+) = ϕ(zγ+) ϕ(z1−γ−) . As a consequence, k10(zγ+) = ϕ(zγ+)(z 2 γ−− z 2 γ+) . If zγ+ = +∞, then k 0
1(zγ+) = 0. It can been proved that γ+ = 0corresponds
to a minimum for k1(.). As a result, the eciency κ reaches its maximum when
γ−= γ.
5. Applications
In this Section, we propose to illustrate the theoretical results obtained in this paper. For all the following applications, we will consider statistical tests at the 5%level. If we call hn(tk, tk+1) = T2 n(tk) + Tn2(tk+1) − 2ρ(tk, tk+1)Tn(tk)Tn(tk+1) 1 − ρ2(t k, tk+1) 1Tn(tk+1) Tn(tk) ∈]ρ(tk,tk+1), 1 ρ(tk,tk+1)[ , as explained before, an easy way to perform our statistical test is to use the test
statistic
Mn= maxTn2(t1), Tn2(t2), hn(t1, t2), ..., Tn2(tK−1), Tn2(tK), hn(tK−1, tK) .
Our rst result is that the threshold (i.e. critical value) is the same if only the genotypes of the non extreme individuals (i.e. the individuals for which Y ∈ [S−, S+]) are available or if all the genotypes are available (i.e. the oracle
situation). So, the Monte-Carlo Quasi Monte-Carlo method, proposed by Azaïs et al. [2012] (based on Genz [1992]) for the oracle situation, is still suitable here to obtain our threshold. Note that in Azaïs et al. [2012], the authors show
that their method gives better results than the method of Feingold and al. [1993] based on Siegmund [1985], and the method of Rebaï et al. [1994] based on Davies [1977] and Davies [1987]. This way, in Tables 1 and 2, we propose to check on simulated data, the fact that the threshold is the same as in the oracle situation. First, in Table 1, we consider a sparse map : a chromosome of length T = 1M, with two genetic markers located at each extremity. For such a conguration, if we choose a 5% level, the corresponding threshold is 4.89. We consider γ = 0.2. In other words we have 20% of missing genotypes. Besides, we consider dierent values for the percentage γ+ of indivuals not genotyped in the right tail of the
distribution. We can see that, whatever the value of γ+, the Percentage of False
Positives is close to the true level of the test (i.e. 5% ) even for small values of n (see n = 50). Then, in Table 2, we consider a more dense genetic map. We still consider a chromosome of length T = 1M, but 6 genetic markers are now equally spaced every 20cM. We can remark that, as previously, the Percentage of False Positives is close to 5%.
Let's now focus on the alternative hypothesis. To begin, in Table 3, we consider the sparse map and the same value of γ as previously. For the QTL eect q, we consider a = 4 : we remind that q = a/√n. We focus on dierent locations t? of the QTL and dierent values of γ
+. We present the Theoretical
Power based on 100000 paths of the asymptotic process, and also the Empirical Power (in brackets) obtained for n = 1000. We can see that the Theoretical Power and the Empirical Power are very close whatever the values of t? and
γ+ are. Besides, as expected (cf. Theorem 3), we can see that the Theoretical
Power is maximum when the missing genotypes are all located in the right tail of the distribution (i.e. γ+ = γ). Note that we would have obtained the same
result for γ−= γ (i.e. left tail). Finally, in Table 4, we consider the dense map,
and we change the value of a : a = 6. We obtain the same kind of conclusions as before. This result was expected since all the theoretical results obtained in this paper, are suitable for any kind of genetic map.
To conclude, we present in this paper easy ways to analyze data in presence of missing genotypes. That's why it must be interesting for geneticists.
H H H H H γ+ n 1000 200 100 50 γ/4 4.98% 5.43% 4.60% 4.79% γ/2 5.10% 4.87% 5.17% 4.94% γ 5.18% 5.31% 4.49% 4.61%
Fig. 1. Percentage of False Positives as a function of the number of individuals n and
the percentage of individuals γ+not genotyped in the right tail. The chromosome is of
length T = 1M and 2 markers are located at each extremity (γ = 0.2, 1 − γ = 0.8, a = 0, σ = 1, µ = 0, 10000 samples of n individuals).
H H H H H γ+ n 1000 200 100 50 γ/4 5.08% 4.65% 4.59% 4.48% γ/2 5.01% 4.70% 4.56% 4.44% γ 5.06% 4.67% 4.28% 4.20%
Fig. 2. Percentage of False Positives as a function of the number of individuals n and
the percentage of individuals γ+not genotyped in the right tail. The chromosome is of
length T = 1M and 6 markers are equally spaced every 20cM (γ = 0.2, 1 − γ = 0.8, a = 0, σ = 1, µ = 0, 10000 samples of n individuals). H H H H H γ+ t? 5cM 12cM 18cM γ/4 60.46% 58.88% 62.88% (60.75%) (58.43%) (62.18%) γ/2 56.17% 54.57% 58.44% (56.68%) (54.82%) (58.16%) γ 76.61% 74.79% 79.10% (76.34%) (74.71%) (78.60%)
Fig. 3. Theoretical power and Empirical Power (in brackets) as a function of the location
of the QTL t?and the percentage γ
+of individuals non genotyped in the right tail. The
chromosome is of length T = 20cM and 2 markers are located at each extremity (γ = 0.2, 1 − γ = 0.8, a = 4, σ = 1, µ = 0, 10000 samples of n = 1000 individuals, 100000 paths for the Theoretical Power).
H H H H H γ+ t? 12cM 35cM 48cM 77cM γ/4 83.99% 86.39% 84.91% 87.87% (83.47%) (85.91%) (84.20%) (87.19%) γ/2 80.14% 82.75% 80.96% 84.00% (79.62%) (81.88%) (80.37%) (83.25%) γ 95.36% 96.23% 95.48% 96.85% (94.87%) (96.07%) (94.83%) (96.89%)
Fig. 4. Theoretical power and Empirical Power (in brackets) as a function of the location
of the QTL t?and the percentage γ
+of individuals non genotyped in the right tail. The
chromosome is of length T = 1M and 6 markers are equally spaced every 20cM (γ = 0.2, 1 − γ = 0.8, a = 6, σ = 1, µ = 0, 10000 samples of n = 1000 individuals, 100000 paths for the Theoretical Power).
Charles-Elie Rabier (rabier@stat.wisc.edu)
University of Wisconsin-Madison, Statistic and Botany departments, Medical Science Center, 1300 University Ave., Madison, WI 53706-1532. References
Arends, D., Prins, P., Broman, K., Jansen, R. (2010). Tutorial - Multiple-QTL Mapping (MQM) Analysis. Technical Report.
Azaïs, J. M. and Cierco-Ayrolles, C. (2002). An asymptotic test for quantitative gene detection. Ann. I. H. Poincaré, 38, 6, 1087-1092.
Azaïs, J. M., Gassiat, E., Mercadier, C. (2006). Asymptotic distribution and local power of the likelihood ratio test for mixtures. Bernoulli, 12(5), 775-799. Azaïs, J. M., Gassiat, E., Mercadier, C. (2009). The likelihood ratio test for general mixture models with possibly structural parameter. ESAIM, 13, 301-327.
Azaïs, J. M. and Wschebor, M. (2009). Level sets and extrema of random pro-cesses and elds. Wiley, New-York.
Azaïs, J. M., Delmas, C., Rabier, C-E (2012). Likelihood Ratio Test process for Quantitative Trait Locus detection. hal-00483171, in revision for Statistics. Chang, M. N., Wu, R., Wu, S. S., Casella, G. (2009). Score statistics for
map-ping quantitative trait loci. Statistical Application in Genetics and Molecular Biology, 8(1), 16.
Cierco, C. (1998). Asymptotic distribution of the maximum likelihood ratio test for gene detection. Statistics, 31, 261-285.
Churchill, G.A, Doerge, R.W. (1994). Empirical threshold values for quantitative trait mapping. Genetics, 138, 963-971.
Darvasi, D., Soller, M. (1992). Selective genotyping for determination of linkage between a marker locus and a quantitative trait locus. Theoretical and Applied Genetics, 85, 353-359.
Davies, R.B. (1977). Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika, 64, 247-254.
Davies, R.B. (1987). Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika, 74, 33-43.
Feingold, E., Brown, P.O., Siegmund, D. (1993). Gaussian models for genetic linkage analysis using complete high-resolution maps of idendity by descent. Am. J. Human. Genet., 53, 234-251.
Gassiat, E. (2002). Likelihood ratio inequalities with applications to various mixtures. Ann. I. H. Poincaré, 6, 897-906.
Genz, A. (1992). Numerical computation of multivariate normal probabilities. J. Comp. Graph. Stat., 141-149.
Haldane, J.B.S (1919). The combination of linkage values and the calculation of distance between the loci of linked factors. Journal of Genetics, 8, 299-309. Lander, E.S., Botstein, D. (1989). Mapping mendelian factors underlying
quan-titative traits using RFLP linkage maps. Genetics, 138, 235-240.
Lebowitz, R.J., Soller, M., Beckmann, J.S. (1987). Trait-based analyses for the detection of linkage between marker loci and quantitative trait loci in crosses between inbred lines. Theoretical and Applied Genetics, 73, 556-562.
Muranty, H., Gonet, B. (1997). Selective genotyping for location and estima-tion of the eect of the eect of a quantitative trait locus. Biometrics, 53, 629-643.
Rabier, C-E. (2012a). On statistical inference for selective genotyping. hal-00658583.
Rabier, C-E. (2012b). On stochastic processes for Quantitative Trait Locus mapping under selective genotyping. hal-00675414.
Rebaï, A., Gonet, B., Mangin, B. (1994). Approximate thresholds of interval mapping tests for QTL detection. Genetics, 138, 235-240.
Rebaï, A., Gonet, B., Mangin, B. (1995). Comparing power of dierent meth-ods for QTL detection. Biometrics, 51, 87-99.
Siegmund, D. (1985). Sequential analysis : tests and condence intervals. Springer, New York.
Siegmund, D., Yakir, B. (2007). The statistics of gene mapping. Springer, New York.
Van der Vaart, A.W. (1998) Asymptotic statistics, Cambridge Series in Statis-tical and Probabilistic Mathematics.
Wu, R., MA, C.X., Casella, G. (2007) Statistical Genetics of Quantitative Traits, Springer