HAL Id: hal-00610615
https://hal.archives-ouvertes.fr/hal-00610615
Preprint submitted on 23 Jul 2011
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
A new method for Quantitative Trait Loci Detection
Charles-Elie Rabier, Céline Delmas
To cite this version:
Charles-Elie Rabier, Céline Delmas. A new method for Quantitative Trait Loci Detection. 2010.
�hal-00610615�
A new method for Quantitative Trait Loci detection
Charles-Elie Rabier
Institut de Mathématiques de Toulouse, Toulouse, France.
INRA UR631, Auzeville, France.
Céline Delmas
INRA UR631, Auzeville, France.
Summary. We consider the likelihood ratio test (LRT) process related to the test of the ab- sence of QTL on the interval [0, T ] representing a chromosome (a QTL denotes a quantitative trait locus, i.e. a gene with quantitative effect on a trait). We give the asymptotic distribution of this LRT process under the general alternative that there exist m QTL on [0, T ]. This theo- retical result allows us to propose to estimate the number of QTL and their positions using the LASSO. Our method does not require the choice of cofactors contrary to Composite Interval Mapping (CIM). Besides, our method is not affected by interactions.
Keywords: Gaussian process, Likelihood Ratio Test, Mixture models, Nuisance parameters present only under the alternative, QTL detection, χ
2process.
1. Introduction
Westudyabakrosspopulation:
A × (A × B)
,whereA
andB
arepurelyhomozygouslinesandweaddresstheproblemofdetetingQuantitativeTraitLoi,so-alledQTL(genesinu-
eningaquantitativetraitwhihisabletobemeasured)onagivenhromosome. Thetrait
isobservedon
n
individuals(progenies)andwedenotebyY
j, j = 1, ..., n
,theobservations, whih wewill assumetobeindependentandidentially distributed (iid). Themehanismofgenetis,ormorepreiselyofmeiosis,impliesthat amongthetwohromosomesofeah
individual,oneispurely inheritedfrom
A
whiletheother(thereombined"one),onsists ofpartsoriginatedfromA
andpartsoriginatedfromB
,duetorossing-overs.TheHaldane (1919)modelling assumesthat rossoversourasa Poisson proess. Using the Haldane(1919)distane and modelling, eah hromosome will berepresented bya segment
[0, T ]
.Thedistane on
[0, T ]
isalledthegenetidistane(whihismeasuredinMorgans).Inafamous artile,Landerand Botstein(1989)proposed,withthehelp of genetimark-
ers,to santhe hromosome,performingalikelihood ratiotest (LRT)of theabsene ofa
QTL at every loation
t ∈ [0, T ]
. It leadsto alikelihood ratiotest proess"Λ
n(.)
, andthenanaturalstatistiisthesupremumof suhaproess. This methodisalledinterval
mapping". There have been many papers related to the supremum of the LRT proess.
Forexample,weanmentionFeingoldandal.(1993),ChurhillandDoerge(1994),Rebaï
andal.(1994), Rebaïand al.(1995), Ciero(1998), Piepho(2001),Chang and al.(2009),
Rabier(2010).
Theproblem is that onsideringthe supremumof theproessasatest statisti isappro-
priatewhenthereisonlyoneQTLonthehromosomebutitbeomesinappropriatewhen
there areseveral QTL loatedon thehromosome. Besides, generallygenetiists haveno
a more general approah has to be onsidered. When multiple QTL our on the same
hromosome,theyaetsimultanouslytheLRTproess. Forinstane,whentwoQTLare
loatedintwodierentmarkerintervallosebutnotadjaent,apeakisoftenfoundbetween
thesetwomarkerinterval: itisaghostQTL(MartinezandCurnow(1992)). Jansen(1993)
andZeng(1994)proposedindependentlytheCompositeIntervalMapping",whihonsists
inombiningintervalmappingontwoankingmarkersandmultipleregressionanalysison
othermarkers(Wuand al.(2007)). This way,theQTLnotloatedin themarkerinterval
testeddo notaet anymorethe LRTproess. Their eetsare removeddue to multiple
regressionanalysis. Howewer, thehoie of markersasofator isveryompliated. It is
stillanopenquestiontoday. Untilnow,therehasbeennomathematialproofwhihould
helpusonhowtohoosethesetofmarkersrigorously. Inthisontext,theaimofourpaper
istoproposeanalternativetoCompositeIntervalMapping",thatistosayanewmethod
whihdoesnotrequirethehoieofofators.
Asmentionedbefore,inRabier(2010),theauthorssupposethatthereisnomorethanone
QTLonthehromosome(itis loatedat
t
⋆∈ [0, T ]
). Theyshowthat theLRTproess isasymptotiallythesquareofanonlinearinterpolatedproess"entered under
H
0 (ie. noQTLonthehromosome)andunentered ofameanfuntion under thealternative. This
meanfuntiondependsontheQTLeet anditsloation
t
⋆. Inthispaper,wegeneralizetheseresultstothegeneralalternativethat thereexist
m
QTLon[0, T ]
att
⋆1, · · · , t
⋆m withadditiveeets
q
1, · · · , q
m.Themain dierenesbetweenthealternativeofonlyoneQTLandthegeneralalternative,
isinthedistributionofthetrait
Y
. WhenthereisonlyoneQTLatt
⋆∈ [0, T ]
,thetraitY
,onditionallytoinformationbroughtbygenetimarkersloatedonthehromosome,obeys
toamixture modelwithknownweights:
p(t
⋆)f
(µ+q,σ)(.) + { 1 − p(t
⋆) } f
(µ−q,σ)(.)
(1)where
f
(µ,σ)(.)
denotesaGaussiandensitywithmeanµ
andvarianeσ
2.(µ, q, σ)
aretheunknownparameters.
When there are
m
QTL segregating, the distribution of the traitY
, is a mixture of2
momponentsoftheform:
2m
X
α=1
w
αf
(Mα,σ)(.)
wherethe
w
αsandtheM
αsare knownfuntions oftheunknownparametersµ
,m
,t
⋆1, ...,t
⋆m,q
1,...,q
m.Inthisontext,weshowthatunderthegeneralalternative,theLRTproessisstillasymp-
totially the square of anon linear interpolated proess". Howewer, the mean funtion
depends this time onthe numberof QTL,their positions andtheir eets. This theoret-
ialresult allowsus to propose a newmethod to estimate the number of QTL and their
positions using theLASSO.Note that in this paper,asin Broman andSpeed(2002), the
fous is mainly onthe estimation of thenumberof QTL andtheir positions, rather than
ontheestimation oftheQTLeets. Nevertheless,theeetsanbeobtainedeasilywith
themethodthatwepropose.
Theoriginalityofourpaperistwofold. First,withourasymptoti studyofthe LRTpro-
essunderthegeneralalternative,wearenowabletoexplainmathematiallysomestrange
betweentwotrueQTL.Seondly,theoriginalityisinthefatthatweproposeanewmethod
tondQTL.Ourmethodisveryeasytoimplementanddoesnotrequirethehoieofmark-
ersas ofators whih is amajor drawbak of Composite Interval Mapping. Besides, we
provethat our method is not aeted by interations. With the help of simulateddata,
weshowthat ourmethod performs better thantheCompositeIntervalMappingwhihis
largelyused in thegeneti ommunity. Werefer to thebook ofVan derVaart (1998)for
elementofasymptotistatistisusedin proofs.
2. Model and Notations
Thehromosomeisthesegment
[0, T ]
.K
genetimarkersareloatedonthehromosome,oneat eah extremity.
t
1= 0 < t
2< ... < t
K= T
are theloations ofthemarkers. Thegenomeinformation"at
t
willbedenotedX (t)
. TheHaldane(1919)model,whihassumesthat rossoversouras aPoissonproess, anbewrittenmathematially : let
N (t)
beastandardPoisson proess,thelawof
X(t)
is 12(δ
1+ δ
−1)
andX(t) = ( − 1)
N(t)X (t
1)
. TheHaldane(1919)funtion
r : [0, T]
27−→
0,
12issuh as:
r(t, t
′) = P (X (t)X(t
′) = − 1) = P ( | N (t) − N (t
′) |
odd) = 1
2 (1 − e
−2|
t−t′| )
¯
r(t, t
′)
willbethefuntion equalto1 − r(t, t
′)
.r(t, t
′)
denotestheprobabilityof reombinationbetweentwoloi(ie. positions)loatedatt
andt
′.r(t, t ¯
′)
denotestheabseneofreombination. Notethatareombinationoursif thereisanoddnumberofrossoversbetweenthetwoloi.Weareinterestedinaquantitativetrait
Y
whihisaetedbyseveralQTLloatedonthehromosome.
m
willrefertothenumberofQTLandq
stotheQTLeetofthesthQTL.Itsposition will bealled
t
⋆s. Weimpose0 < t
⋆1< ... < t
⋆m< T
and wewillsupposethattheQTL eets areadditives and there is no interation betweenthem. In this ontext,
thequantitativetrait
Y
veries:Y = µ +
m
X
s=1
X(t
⋆s) q
s+ σε
where
ε
isaGaussian whitenoise.Besides, the genome information"is available only at loations of geneti markers, that
is to say at
t
1, t
2, ..., t
K. We denote byX
j(t)
the value of the variableX (t)
for the jthobservation. So, in fat, our observation on eah individual is
(Y
j, X
j(t
1), ..., X
j(t
K))
.Theseobservationsaresupposed tobeiid.
3. LRT process under the alternative of only one QTL located on [0, T ] (Rabier (2010))
Before etablishing the general result of this paper, we rst should fous on the work of
Rabier (2010), that is to say the ase where there is only one QTL lying on
[0, T ]
(ie.m = 1
). It will be agood wayto introdue the LRT proess and will make thereadingof our paper easier. In order to sum up this previous work, we will onsider the same
elementsand notationsused bytheauthors. Assaid previously, theauthors fous onthe
hromosome, performing a likelihood ratio test (LRT) of the absene of a QTLat every
loation
t ∈ [0, T ]
.Weonsider values ofthe parameter
t
that are distint ofthe markerspositions, and theresultwillbeprolongedbyontinuityat themarkerspositions. For
t ∈ [t
1, t
K] \ T
K whereT
K= { t
1, ..., t
K}
,wedenet
ℓandt
ras:t
ℓ= sup { t
k∈ T
k: t
k< t } , t
r= inf { t
k∈ T
k: t < t
k}
Inotherwords,
t
belongsto theMarkerinterval"(t
ℓ, t
r)
. Wedenep(t)
theweightsuhas
p(t) = P
X (t) = 1
X(t
ℓ), X(t
r)
.BytheBayesrule,
p(t) = Q
1,1t1
X(tℓ)=11
X(tr)=1+ Q
1,t−11
X(tℓ)=11
X(tr)=−1+ Q
−t1,11
X(tℓ)=−11
X(tr)=1+ Q
−t1,−11
X(tℓ)=−11
X(tr)=−1 (2)where:
Q
1,1t= r(t ¯
ℓ, t) ¯ r(t, t
r)
¯
r(t
ℓ, t
r) , Q
1,t−1= r(t ¯
ℓ, t) r(t, t
r) r(t
ℓ, t
r) Q
−t1,−1= 1 − Q
1,1t andQ
−t1,1= 1 − Q
1,t−1Let
θ = (q, µ, σ)
betheparameterofthemodelatt
xedandθ
0= (0, µ, σ)
thetruevalueof the parameterunder
H
0. The likelihood of the tripletY, X(t
ℓ), X (t
r)
with respet
tothemeasure
λ ⊗ N ⊗ N
,λ
beingtheLebesguemeasure,N
theountymeasureonN
, is∀ t ∈ [t
ℓ, t
r]
:L(θ, t) =
p(t)f
(µ+q,σ)(y) + { 1 − p(t) } f
(µ−q,σ)(y)
g(t)
(3)where
g(t)
isafuntion independentofθ
.Thelikelihood
L
n(θ, t)
forn
observationsisobtainedbytheprodutofn
termsasabove.θ ˆ = (ˆ q, µ, ˆ σ) ˆ
willbethemaximumlikelihoodestimator(MLE)ofθ
.Under
H
0,there is noQTLlyingon theinterval[0, T ]
. Besides,underH
1, it issupposedthatthere isonlyoneloationwhere theQTLlies(ie.
m = 1
). Inorder todealwiththisalternative, theloation ofthe QTL,
t
⋆ (t
⋆∈ [0, T ]
),has to beadded in thedenition ofH
1. So,thealternativehypothesis anbewritten :H
at⋆:
theQTLisloatedatthepositiont
⋆witheetq = a/ √
n
wherea ∈ R
⋆ "In this ontext, the authors show that the LRT proess,
Λ
n(.)
, onvergesweakly to thesquare of a non linear interpolated proess". It means that the LRT statistis at eah
pointaneasilybededuedfromtheWaldorsorestatistisalulatedatmarkerspositions.
Besides, this non linear interpolated proess" is entered under
H
0 and unentered of ameanfuntion
m
t⋆(t)
underH
at⋆. ThismeanfuntiondependsontheloationoftheQTLt
⋆,thepositiontestedt
andtheparametera
linkedtotheQTLeet. Itisalsoanonlinearinterpolatedfontion" (sameinterpolation astheproess). Then,sinethey supposethat
thereisonlyoneQTLon
[0, T ]
,theauthorshavealoseformula(duetotheinterpolation) toomputethesupremumofΛ
n(.)
.4. LRT process under the general alternative of m QTL on [0, T ]
Inthe previousSetion, it has been supposed that there wasonly one QTLlying on the
interval
[0, T ]
. As aonsequene,thetest statistiused wasanaturalstatisti, that istosay the supremum of the proess. The interest is now on studying the same proess as
previously,
Λ
n(.)
,butunderthepreseneofseveralQTLontheinterval[0, T ]
. Inthisase,thegoalisnotto performatestanymore,buttobeabletorunamodelseletioninorder
toestimatethenumberofQTLandtheirloations.
Letdenote
~t
⋆ thequantityreferingto theloationsof theQTL.H
a~t⋆ willbethefollowingassumption:
H
a~t⋆: there arem
QTLloatedrespetivelyatt
⋆1,...,t
⋆m andwitheetq
1=
√a1n,...,q
m=
√amn where(a
1, ..., a
m) ∈ R
m⋆"WeremindthatwesupposethattheQTLeetsareadditivesandthatthereisnointera-
tionbetweenthem. Wewillonsidervalues
t
,t
⋆1,...,t
⋆moftheparametersthat aredistintof the markers positions, and the result will be prolonged by ontinuity at the markers
positions.
4.1. Results
TheoremWith the previousdenednotations,
S
n(.) ⇒ Z
⋆(.) , Λ
n(.)
F.d.→ { Z
⋆(.) }
2asn tendstoinnity,under
H
0 andH
a~t⋆ where:• S
n(.)
is thesoreproessforn
observations• ⇒
isthe weak onvergeneand F.d.→
isthe onvergeneofnite-dimensional distribu- tions• Z
⋆(.)
isaGaussian proesswith unitvariane.• Z
⋆(.)
isthe ontinuousandthe non linear interpolatedproess"suhas:Z
⋆(t) =
α(t) Z
⋆(t
ℓ) + β(t) Z
⋆(t
r) / r
E h
{ 2p(t) − 1 }
2i
The meanfuntion of
Z
⋆(.)
:•
underH
0,m(t) = 0
•
underH
a~t⋆,m
~t⋆(t) =
α(t) m
~t⋆(t
ℓ) + β(t) m
~t⋆(t
r) / r
E h
{ 2p(t) − 1 }
2i
Thedierent quantitiesare:
α(t) = Q
1,1t+ Q
1,t−1− 1, β(t) = Q
1,1t− Q
1,t−1,
CovZ(t
ℓ), Z(t
r) = e
−2(tr−tℓ)m
~t⋆(t
ℓ) =
m
X
s=1
a
se
−2|
t⋆s−tℓ| / σ , m
~t⋆(t
r) =
m
X
s=1
a
se
−2|tr−t⋆s|/ σ ,
and
E h
{ 2p(t) − 1 }
2i
= { α(t) }
2+ { β(t) }
2+ 2 α(t) β(t)e
−2(tr−tℓ).
TheproofisgiveninSetion 7.1.
4.2. Illustration of the theorem and of the Ghost QTL phenomenon
0 20 40 60 80 100
−1.8
−1.6
−1.4
−1.2
−1
−0.8
−0.6
−0.4
−0.2 0
t(cM)
Z*(t)
0 20 40 60 80 100
0 0.5 1 1.5 2 2.5 3
t(cM)
( Z*(t) )
2Proess
Z
⋆(.)
Proess{ Z
⋆(.) }
2Fig. 1. A path under H
0of the processes Z
⋆(.) and {Z
⋆(.)}
2(T = 100cM, 6 markers equally spaced
every 20cM)
0 20 30 40 60 70 80 100 0.5
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
t(cM)
t*1=70cM and a 1=4 t*1=30cM and a
1=4 t*1=70cM and a
1=6
0 20 30 40 50 60 70 80 100 3
3.5 4 4.5 5 5.5 6 6.5 7 7.5 8
t(cM)
a2=4 a2=6
m = 1 m = 2
,t
⋆1= 30
M,t
⋆2= 70
M,a
1= 4
Fig. 2. Mean function m
~t⋆(t) as a function of the number m of QTL, their positions t
⋆s, and the
parameters a
slinked to the QTL effects (T = 100cM, 6 markers equally spaced every 20cM)
0 20 40 60 80 100 1.5
2 2.5 3 3.5 4 4.5 5 5.5 6
t(cM)
Z*(t)
a2=4 a2=6
0 20 40 60 80 100
0 5 10 15 20 25 30 35
t(cM)
( Z*(t) )
2a2=4 a2=6
Proess
Z
⋆(.)
Proess{ Z
⋆(.) }
2Fig. 3. Same path of Z
⋆(.) and {Z
⋆(.)}
2as under H
0but under H
a~t⋆(m = 2, t
⋆1= 30cM, t
⋆2= 70cM,
a
1= 4, T = 100cM, 6 markers equally spaced every 20cM)
In order to illustrate the theorem, we will onsider a geneti map whih onsists of
a hromosome of size
T = 100
M with6
markers equally spaed every20
M. Figure 1refersto theabsene of QTLon thehromosome. On theleft-side, a pathof theproess
Z
⋆(.)
is represented underH
0. As there is not any QTL, it orresponds only to noise.Besides, we an observe the interpolation obtained between geneti markers. The same
pathorrespondingtotheproess
{ Z
⋆(.) }
2 hasbeenaddedontheright-side: in genetis,we all this path "a likelihood prole". It is usually this path that we obtain when we
analyzedata. Note that manyauthors, insteadof omputing theproess
Λ
n(.)
, fous ontheLOD proess,
LOD
n(.)
whereLOD
n(.) = Λ
n(.)/ { 2 log(10) }
.Figure 2 represents the signal. On the left-side, we present some mean funtions
m
~t⋆(t)
whenonly oneQTL(
m = 1
)is loated onthehromosome. As expeted, the supremumofthese interpolatedfuntions is obtainedatthe loationofthe QTL.Besides, thelarger
theQTLeetis,thestrongerthesignalis. Ontheright-side,thefousison
m
~t⋆(t)
whenm = 2
. Aording to the theorem,m
~t⋆(t)
is obtained by summing the mean funtionsorrespondingto the ase
m = 1
. As aonsequene,thefuntionsm
~t⋆(t)
of the graphoftheright-sideareeasilyobtainedfromthoseofthegraphoftheleft-side. Let'sfousonthe
urveinsolidline. ThetwoQTLareloatedrespetivelyat
t
⋆1= 30
Mandt
⋆2= 70
M.So,themarkerinterval(
40
M,60
M) isadjaenttothe twomarkerintervalswhere theQTLareloated. Asaresult,wean observeonthegraphthat thebiggestpeakis obtainedin
theinterval(
40
M,60
M)andthatthesupremumisobtainedin themiddleof thismarkerinterval, at
50
M. Note that it is obtainedexatlyat50
M sine we onsider exatlythesame eet (
a
1= a
2= 4
) and that there is symmetry due to the loation of the QTLandthelength ofthehromosome. Ifnowweonsider alargereet fortheseond QTL
(
a
2= 6
)loatedatt
⋆2= 70
M(dashedline),weanobservealmostthesametwopeaksintheintervals(
40
M,60
M)and(80
M,100
M).Besides,thesupremumofthemeanfuntionis obtainedat
52
M. It is like abaryenter : someweights are aeted to the QTL asafuntionoftheireets,sothesignalandtheloationofthesupremumisaetedbythese
weights.
Figure3istheanalogousofFigure1under thealternativeof
2
QTLloatedatt
⋆1= 30
Mand
t
⋆2= 70
M. As in Figure 1, the path of theproessZ
⋆(.)
is on the left-side whereastheoneorrespondingto
{ Z
⋆(.) }
2isontheright-side. Aordingto thetheorem, inorder to obtainthe path ofZ
⋆(.)
underH
a~t⋆, wehave to sum thepath ofZ
⋆(.)
underH
0 (ie.the noise), and the mean funtion
m
~t⋆(t)
(ie. the signal). In other words, the path ofZ
⋆(.)
underH
a~t⋆ hasbeenobtainedbyaddingthepathofZ
⋆(.)
presentedinFigure1andthemean funtion of the graphof theright-sideof Figure 2. Note that on theright-side
of Figure 3, the likelihood prole (ie. the path of
{ Z
⋆(.) }
2) haseasily been obtained byomputationof thesquare of
Z
⋆(.)
. We anobservein Figure3that, whenthe eetsofthetwoQTLarethesame(ie. thesolidlines),thebiggestpeakisobtainedbetween
40
Mand
60
MwhihisamarkerintervalwherethereisnoQTL:suhapeakisalledaghostQTL(MartinezandCurnow(1992)). Itwasexpetedsinethesupremumofthesignalwas
obtainedat
50
M.Notethat whenweinreasetheeetoftheseondQTL(ie. thedashedlines),thebiggest
peakis obtainedin themarkerinterval(
60
M,80
M)whihistheintervalwhihontainstheseond QTL.Itis dueto thenoisesinethesignalisalmost thesameinthe intervals
(
40
M,60
M) and (60
M,80
M) whereas the values ofZ
⋆(.)
are larger underH
0 in themarkerinterval(