• Aucun résultat trouvé

A new method for Quantitative Trait Loci Detection

N/A
N/A
Protected

Academic year: 2021

Partager "A new method for Quantitative Trait Loci Detection"

Copied!
26
0
0

Texte intégral

(1)

HAL Id: hal-00610615

https://hal.archives-ouvertes.fr/hal-00610615

Preprint submitted on 23 Jul 2011

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

A new method for Quantitative Trait Loci Detection

Charles-Elie Rabier, Céline Delmas

To cite this version:

Charles-Elie Rabier, Céline Delmas. A new method for Quantitative Trait Loci Detection. 2010.

�hal-00610615�

(2)

A new method for Quantitative Trait Loci detection

Charles-Elie Rabier

Institut de Mathématiques de Toulouse, Toulouse, France.

INRA UR631, Auzeville, France.

Céline Delmas

INRA UR631, Auzeville, France.

Summary. We consider the likelihood ratio test (LRT) process related to the test of the ab- sence of QTL on the interval [0, T ] representing a chromosome (a QTL denotes a quantitative trait locus, i.e. a gene with quantitative effect on a trait). We give the asymptotic distribution of this LRT process under the general alternative that there exist m QTL on [0, T ]. This theo- retical result allows us to propose to estimate the number of QTL and their positions using the LASSO. Our method does not require the choice of cofactors contrary to Composite Interval Mapping (CIM). Besides, our method is not affected by interactions.

Keywords: Gaussian process, Likelihood Ratio Test, Mixture models, Nuisance parameters present only under the alternative, QTL detection, χ

2

process.

1. Introduction

Westudyabakrosspopulation:

A × (A × B)

,where

A

and

B

arepurelyhomozygouslines

andweaddresstheproblemofdetetingQuantitativeTraitLoi,so-alledQTL(genesinu-

eningaquantitativetraitwhihisabletobemeasured)onagivenhromosome. Thetrait

isobservedon

n

individuals(progenies)andwedenoteby

Y

j

, j = 1, ..., n

,theobservations, whih wewill assumetobeindependentandidentially distributed (iid). Themehanism

ofgenetis,ormorepreiselyofmeiosis,impliesthat amongthetwohromosomesofeah

individual,oneispurely inheritedfrom

A

whiletheother(thereombined"one),onsists ofpartsoriginatedfrom

A

andpartsoriginatedfrom

B

,duetorossing-overs.TheHaldane (1919)modelling assumesthat rossoversourasa Poisson proess. Using the Haldane

(1919)distane and modelling, eah hromosome will berepresented bya segment

[0, T ]

.

Thedistane on

[0, T ]

isalledthegenetidistane(whihismeasuredinMorgans).

Inafamous artile,Landerand Botstein(1989)proposed,withthehelp of genetimark-

ers,to santhe hromosome,performingalikelihood ratiotest (LRT)of theabsene ofa

QTL at every loation

t ∈ [0, T ]

. It leadsto alikelihood ratiotest proess"

Λ

n

(.)

, and

thenanaturalstatistiisthesupremumof suhaproess. This methodisalledinterval

mapping". There have been many papers related to the supremum of the LRT proess.

Forexample,weanmentionFeingoldandal.(1993),ChurhillandDoerge(1994),Rebaï

andal.(1994), Rebaïand al.(1995), Ciero(1998), Piepho(2001),Chang and al.(2009),

Rabier(2010).

Theproblem is that onsideringthe supremumof theproessasatest statisti isappro-

priatewhenthereisonlyoneQTLonthehromosomebutitbeomesinappropriatewhen

there areseveral QTL loatedon thehromosome. Besides, generallygenetiists haveno

(3)

a more general approah has to be onsidered. When multiple QTL our on the same

hromosome,theyaetsimultanouslytheLRTproess. Forinstane,whentwoQTLare

loatedintwodierentmarkerintervallosebutnotadjaent,apeakisoftenfoundbetween

thesetwomarkerinterval: itisaghostQTL(MartinezandCurnow(1992)). Jansen(1993)

andZeng(1994)proposedindependentlytheCompositeIntervalMapping",whihonsists

inombiningintervalmappingontwoankingmarkersandmultipleregressionanalysison

othermarkers(Wuand al.(2007)). This way,theQTLnotloatedin themarkerinterval

testeddo notaet anymorethe LRTproess. Their eetsare removeddue to multiple

regressionanalysis. Howewer, thehoie of markersasofator isveryompliated. It is

stillanopenquestiontoday. Untilnow,therehasbeennomathematialproofwhihould

helpusonhowtohoosethesetofmarkersrigorously. Inthisontext,theaimofourpaper

istoproposeanalternativetoCompositeIntervalMapping",thatistosayanewmethod

whihdoesnotrequirethehoieofofators.

Asmentionedbefore,inRabier(2010),theauthorssupposethatthereisnomorethanone

QTLonthehromosome(itis loatedat

t

∈ [0, T ]

). Theyshowthat theLRTproess is

asymptotiallythesquareofanonlinearinterpolatedproess"entered under

H

0 (ie. no

QTLonthehromosome)andunentered ofameanfuntion under thealternative. This

meanfuntiondependsontheQTLeet anditsloation

t

. Inthispaper,wegeneralize

theseresultstothegeneralalternativethat thereexist

m

QTLon

[0, T ]

at

t

1

, · · · , t

m with

additiveeets

q

1

, · · · , q

m.

Themain dierenesbetweenthealternativeofonlyoneQTLandthegeneralalternative,

isinthedistributionofthetrait

Y

. WhenthereisonlyoneQTLat

t

∈ [0, T ]

,thetrait

Y

,

onditionallytoinformationbroughtbygenetimarkersloatedonthehromosome,obeys

toamixture modelwithknownweights:

p(t

)f

(µ+q,σ)

(.) + { 1 − p(t

) } f

(µ−q,σ)

(.)

(1)

where

f

(µ,σ)

(.)

denotesaGaussiandensitywithmean

µ

andvariane

σ

2.

(µ, q, σ)

arethe

unknownparameters.

When there are

m

QTL segregating, the distribution of the trait

Y

, is a mixture of

2

m

omponentsoftheform:

2m

X

α=1

w

α

f

(Mα,σ)

(.)

wherethe

w

αsandthe

M

αsare knownfuntions oftheunknownparameters

µ

,

m

,

t

1, ...,

t

m,

q

1,...,

q

m.

Inthisontext,weshowthatunderthegeneralalternative,theLRTproessisstillasymp-

totially the square of anon linear interpolated proess". Howewer, the mean funtion

depends this time onthe numberof QTL,their positions andtheir eets. This theoret-

ialresult allowsus to propose a newmethod to estimate the number of QTL and their

positions using theLASSO.Note that in this paper,asin Broman andSpeed(2002), the

fous is mainly onthe estimation of thenumberof QTL andtheir positions, rather than

ontheestimation oftheQTLeets. Nevertheless,theeetsanbeobtainedeasilywith

themethodthatwepropose.

Theoriginalityofourpaperistwofold. First,withourasymptoti studyofthe LRTpro-

essunderthegeneralalternative,wearenowabletoexplainmathematiallysomestrange

(4)

betweentwotrueQTL.Seondly,theoriginalityisinthefatthatweproposeanewmethod

tondQTL.Ourmethodisveryeasytoimplementanddoesnotrequirethehoieofmark-

ersas ofators whih is amajor drawbak of Composite Interval Mapping. Besides, we

provethat our method is not aeted by interations. With the help of simulateddata,

weshowthat ourmethod performs better thantheCompositeIntervalMappingwhihis

largelyused in thegeneti ommunity. Werefer to thebook ofVan derVaart (1998)for

elementofasymptotistatistisusedin proofs.

2. Model and Notations

Thehromosomeisthesegment

[0, T ]

.

K

genetimarkersareloatedonthehromosome,

oneat eah extremity.

t

1

= 0 < t

2

< ... < t

K

= T

are theloations ofthemarkers. The

genomeinformation"at

t

willbedenoted

X (t)

. TheHaldane(1919)model,whihassumes

that rossoversouras aPoissonproess, anbewrittenmathematially : let

N (t)

bea

standardPoisson proess,thelawof

X(t)

is 12

1

+ δ

1

)

and

X(t) = ( − 1)

N(t)

X (t

1

)

. The

Haldane(1919)funtion

r : [0, T]

2

7−→

0,

12

issuh as:

r(t, t

) = P (X (t)X(t

) = − 1) = P ( | N (t) − N (t

) |

odd

) = 1

2 (1 − e

2

|

tt

| )

¯

r(t, t

)

willbethefuntion equalto

1 − r(t, t

)

.

r(t, t

)

denotestheprobabilityof reombinationbetweentwoloi(ie. positions)loatedat

t

and

t

.

r(t, t ¯

)

denotestheabseneofreombination. Notethatareombinationoursif thereisanoddnumberofrossoversbetweenthetwoloi.

Weareinterestedinaquantitativetrait

Y

whihisaetedbyseveralQTLloatedonthe

hromosome.

m

willrefertothenumberofQTLand

q

stotheQTLeetofthesthQTL.

Itsposition will bealled

t

s. Weimpose

0 < t

1

< ... < t

m

< T

and wewillsupposethat

theQTL eets areadditives and there is no interation betweenthem. In this ontext,

thequantitativetrait

Y

veries:

Y = µ +

m

X

s=1

X(t

s

) q

s

+ σε

where

ε

isaGaussian whitenoise.

Besides, the genome information"is available only at loations of geneti markers, that

is to say at

t

1

, t

2

, ..., t

K. We denote by

X

j

(t)

the value of the variable

X (t)

for the jth

observation. So, in fat, our observation on eah individual is

(Y

j

, X

j

(t

1

), ..., X

j

(t

K

))

.

Theseobservationsaresupposed tobeiid.

3. LRT process under the alternative of only one QTL located on [0, T ] (Rabier (2010))

Before etablishing the general result of this paper, we rst should fous on the work of

Rabier (2010), that is to say the ase where there is only one QTL lying on

[0, T ]

(ie.

m = 1

). It will be agood wayto introdue the LRT proess and will make thereading

of our paper easier. In order to sum up this previous work, we will onsider the same

elementsand notationsused bytheauthors. Assaid previously, theauthors fous onthe

(5)

hromosome, performing a likelihood ratio test (LRT) of the absene of a QTLat every

loation

t ∈ [0, T ]

.

Weonsider values ofthe parameter

t

that are distint ofthe markerspositions, and the

resultwillbeprolongedbyontinuityat themarkerspositions. For

t ∈ [t

1

, t

K

] \ T

K where

T

K

= { t

1

, ..., t

K

}

,wedene

t

and

t

ras:

t

= sup { t

k

∈ T

k

: t

k

< t } , t

r

= inf { t

k

∈ T

k

: t < t

k

}

Inotherwords,

t

belongsto theMarkerinterval"

(t

, t

r

)

. Wedene

p(t)

theweightsuh

as

p(t) = P

X (t) = 1

X(t

), X(t

r

)

.

BytheBayesrule,

p(t) = Q

1,1t

1

X(t)=1

1

X(tr)=1

+ Q

1,t1

1

X(t)=1

1

X(tr)=−1

+ Q

t1,1

1

X(t)=−1

1

X(tr)=1

+ Q

t1,1

1

X(t)=−1

1

X(tr)=−1 (2)

where:

Q

1,1t

= r(t ¯

, t) ¯ r(t, t

r

)

¯

r(t

, t

r

) , Q

1,t1

= r(t ¯

, t) r(t, t

r

) r(t

, t

r

) Q

t1,1

= 1 − Q

1,1t and

Q

t1,1

= 1 − Q

1,t1

Let

θ = (q, µ, σ)

betheparameterofthemodelat

t

xedand

θ

0

= (0, µ, σ)

thetruevalue

of the parameterunder

H

0. The likelihood of the triplet

Y, X(t

), X (t

r

)

with respet

tothemeasure

λ ⊗ N ⊗ N

,

λ

beingtheLebesguemeasure,

N

theountymeasureon

N

, is

∀ t ∈ [t

, t

r

]

:

L(θ, t) =

p(t)f

(µ+q,σ)

(y) + { 1 − p(t) } f

(µ−q,σ)

(y)

g(t)

(3)

where

g(t)

isafuntion independentof

θ

.

Thelikelihood

L

n

(θ, t)

for

n

observationsisobtainedbytheprodutof

n

termsasabove.

θ ˆ = (ˆ q, µ, ˆ σ) ˆ

willbethemaximumlikelihoodestimator(MLE)of

θ

.

Under

H

0,there is noQTLlyingon theinterval

[0, T ]

. Besides,under

H

1, it issupposed

thatthere isonlyoneloationwhere theQTLlies(ie.

m = 1

). Inorder todealwiththis

alternative, theloation ofthe QTL,

t

(

t

∈ [0, T ]

),has to beadded in thedenition of

H

1. So,thealternativehypothesis anbewritten :

H

at

:

theQTLisloatedattheposition

t

witheet

q = a/ √

n

where

a ∈ R

"

In this ontext, the authors show that the LRT proess,

Λ

n

(.)

, onvergesweakly to the

square of a non linear interpolated proess". It means that the LRT statistis at eah

pointaneasilybededuedfromtheWaldorsorestatistisalulatedatmarkerspositions.

Besides, this non linear interpolated proess" is entered under

H

0 and unentered of a

meanfuntion

m

t

(t)

under

H

at. ThismeanfuntiondependsontheloationoftheQTL

t

,thepositiontested

t

andtheparameter

a

linkedtotheQTLeet. Itisalsoanonlinear

interpolatedfontion" (sameinterpolation astheproess). Then,sinethey supposethat

thereisonlyoneQTLon

[0, T ]

,theauthorshavealoseformula(duetotheinterpolation) toomputethesupremumof

Λ

n

(.)

.

(6)

4. LRT process under the general alternative of m QTL on [0, T ]

Inthe previousSetion, it has been supposed that there wasonly one QTLlying on the

interval

[0, T ]

. As aonsequene,thetest statistiused wasanaturalstatisti, that isto

say the supremum of the proess. The interest is now on studying the same proess as

previously,

Λ

n

(.)

,butunderthepreseneofseveralQTLontheinterval

[0, T ]

. Inthisase,

thegoalisnotto performatestanymore,buttobeabletorunamodelseletioninorder

toestimatethenumberofQTLandtheirloations.

Letdenote

~t

thequantityreferingto theloationsof theQTL.

H

a~t willbethefollowing

assumption:

H

a~t: there are

m

QTLloatedrespetivelyat

t

1,...,

t

m andwitheet

q

1

=

a1n,...,

q

m

=

amn where

(a

1

, ..., a

m

) ∈ R

m⋆"

WeremindthatwesupposethattheQTLeetsareadditivesandthatthereisnointera-

tionbetweenthem. Wewillonsidervalues

t

,

t

1,...,

t

moftheparametersthat aredistint

of the markers positions, and the result will be prolonged by ontinuity at the markers

positions.

4.1. Results

TheoremWith the previousdenednotations,

S

n

(.) ⇒ Z

(.) , Λ

n

(.)

F.d.

→ { Z

(.) }

2

asn tendstoinnity,under

H

0 and

H

a~t where:

• S

n

(.)

is thesoreproessfor

n

observations

• ⇒

isthe weak onvergeneand F.d.

isthe onvergeneofnite-dimensional distribu- tions

• Z

(.)

isaGaussian proesswith unitvariane.

• Z

(.)

isthe ontinuousandthe non linear interpolatedproess"suhas:

Z

(t) =

α(t) Z

(t

) + β(t) Z

(t

r

) / r

E h

{ 2p(t) − 1 }

2

i

The meanfuntion of

Z

(.)

:

under

H

0,

m(t) = 0

under

H

a~t,

m

~t

(t) =

α(t) m

~t

(t

) + β(t) m

~t

(t

r

) / r

E h

{ 2p(t) − 1 }

2

i

Thedierent quantitiesare:

α(t) = Q

1,1t

+ Q

1,t1

− 1, β(t) = Q

1,1t

− Q

1,t1

,

Cov

Z(t

), Z(t

r

) = e

2(trt)

m

~t

(t

) =

m

X

s=1

a

s

e

2

|

tst

| / σ , m

~t

(t

r

) =

m

X

s=1

a

s

e

2|trts|

/ σ ,

and

E h

{ 2p(t) − 1 }

2

i

= { α(t) }

2

+ { β(t) }

2

+ 2 α(t) β(t)e

2(trt)

.

(7)

TheproofisgiveninSetion 7.1.

4.2. Illustration of the theorem and of the Ghost QTL phenomenon

0 20 40 60 80 100

−1.8

−1.6

−1.4

−1.2

−1

−0.8

−0.6

−0.4

−0.2 0

t(cM)

Z*(t)

0 20 40 60 80 100

0 0.5 1 1.5 2 2.5 3

t(cM)

( Z*(t) )

2

Proess

Z

(.)

Proess

{ Z

(.) }

2

Fig. 1. A path under H

0

of the processes Z

(.) and {Z

(.)}

2

(T = 100cM, 6 markers equally spaced

every 20cM)

(8)

0 20 30 40 60 70 80 100 0.5

1 1.5 2 2.5 3 3.5 4 4.5 5 5.5

t(cM)

t*1=70cM and a 1=4 t*1=30cM and a

1=4 t*1=70cM and a

1=6

0 20 30 40 50 60 70 80 100 3

3.5 4 4.5 5 5.5 6 6.5 7 7.5 8

t(cM)

a2=4 a2=6

m = 1 m = 2

,

t

1

= 30

M,

t

2

= 70

M,

a

1

= 4

Fig. 2. Mean function m

~t

(t) as a function of the number m of QTL, their positions t

s

, and the

parameters a

s

linked to the QTL effects (T = 100cM, 6 markers equally spaced every 20cM)

(9)

0 20 40 60 80 100 1.5

2 2.5 3 3.5 4 4.5 5 5.5 6

t(cM)

Z*(t)

a2=4 a2=6

0 20 40 60 80 100

0 5 10 15 20 25 30 35

t(cM)

( Z*(t) )

2

a2=4 a2=6

Proess

Z

(.)

Proess

{ Z

(.) }

2

Fig. 3. Same path of Z

(.) and {Z

(.)}

2

as under H

0

but under H

a~t

(m = 2, t

1

= 30cM, t

2

= 70cM,

a

1

= 4, T = 100cM, 6 markers equally spaced every 20cM)

(10)

In order to illustrate the theorem, we will onsider a geneti map whih onsists of

a hromosome of size

T = 100

M with

6

markers equally spaed every

20

M. Figure 1

refersto theabsene of QTLon thehromosome. On theleft-side, a pathof theproess

Z

(.)

is represented under

H

0. As there is not any QTL, it orresponds only to noise.

Besides, we an observe the interpolation obtained between geneti markers. The same

pathorrespondingtotheproess

{ Z

(.) }

2 hasbeenaddedontheright-side: in genetis,

we all this path "a likelihood prole". It is usually this path that we obtain when we

analyzedata. Note that manyauthors, insteadof omputing theproess

Λ

n

(.)

, fous on

theLOD proess,

LOD

n

(.)

where

LOD

n

(.) = Λ

n

(.)/ { 2 log(10) }

.

Figure 2 represents the signal. On the left-side, we present some mean funtions

m

~t

(t)

whenonly oneQTL(

m = 1

)is loated onthehromosome. As expeted, the supremum

ofthese interpolatedfuntions is obtainedatthe loationofthe QTL.Besides, thelarger

theQTLeetis,thestrongerthesignalis. Ontheright-side,thefousison

m

~t

(t)

when

m = 2

. Aording to the theorem,

m

~t

(t)

is obtained by summing the mean funtions

orrespondingto the ase

m = 1

. As aonsequene,thefuntions

m

~t

(t)

of the graphof

theright-sideareeasilyobtainedfromthoseofthegraphoftheleft-side. Let'sfousonthe

urveinsolidline. ThetwoQTLareloatedrespetivelyat

t

1

= 30

Mand

t

2

= 70

M.So,

themarkerinterval(

40

M,

60

M) isadjaenttothe twomarkerintervalswhere theQTL

areloated. Asaresult,wean observeonthegraphthat thebiggestpeakis obtainedin

theinterval(

40

M,

60

M)andthatthesupremumisobtainedin themiddleof thismarker

interval, at

50

M. Note that it is obtainedexatlyat

50

M sine we onsider exatlythe

same eet (

a

1

= a

2

= 4

) and that there is symmetry due to the loation of the QTL

andthelength ofthehromosome. Ifnowweonsider alargereet fortheseond QTL

(

a

2

= 6

)loatedat

t

2

= 70

M(dashedline),weanobservealmostthesametwopeaksin

theintervals(

40

M,

60

M)and(

80

M,

100

M).Besides,thesupremumofthemeanfuntion

is obtainedat

52

M. It is like abaryenter : someweights are aeted to the QTL asa

funtionoftheireets,sothesignalandtheloationofthesupremumisaetedbythese

weights.

Figure3istheanalogousofFigure1under thealternativeof

2

QTLloatedat

t

1

= 30

M

and

t

2

= 70

M. As in Figure 1, the path of theproess

Z

(.)

is on the left-side whereas

theoneorrespondingto

{ Z

(.) }

2isontheright-side. Aordingto thetheorem, inorder to obtainthe path of

Z

(.)

under

H

a~t, wehave to sum thepath of

Z

(.)

under

H

0 (ie.

the noise), and the mean funtion

m

~t

(t)

(ie. the signal). In other words, the path of

Z

(.)

under

H

a~t hasbeenobtainedbyaddingthepathof

Z

(.)

presentedinFigure1and

themean funtion of the graphof theright-sideof Figure 2. Note that on theright-side

of Figure 3, the likelihood prole (ie. the path of

{ Z

(.) }

2) haseasily been obtained by

omputationof thesquare of

Z

(.)

. We anobservein Figure3that, whenthe eetsof

thetwoQTLarethesame(ie. thesolidlines),thebiggestpeakisobtainedbetween

40

M

and

60

MwhihisamarkerintervalwherethereisnoQTL:suhapeakisalledaghost

QTL(MartinezandCurnow(1992)). Itwasexpetedsinethesupremumofthesignalwas

obtainedat

50

M.

Notethat whenweinreasetheeetoftheseondQTL(ie. thedashedlines),thebiggest

peakis obtainedin themarkerinterval(

60

M,

80

M)whihistheintervalwhihontains

theseond QTL.Itis dueto thenoisesinethesignalisalmost thesameinthe intervals

(

40

M,

60

M) and (

60

M,

80

M) whereas the values of

Z

(.)

are larger under

H

0 in the

markerinterval(

60

M,

80

M)thanintheinterval(

40

M,

60

M).

Références

Documents relatifs

Using the GENOTYPIC test, testing for dominance variance at the QTL position detected can be carried out, computing the LR as minus twice the dif- ference between the log-likelihood

For the LDLA method, we did expect biased estimates of the QTL posi- tions when the distance between the QTL were reduced. This was because the method considers only one QTL, and

A population of sheep with an extended breeding season was developed through selection for fertility in spring matings and provides opportunities for further study of candidate

In real data, the multivariate model identified most selected SNPs to be associated with all three milk yield traits (fat, milk and protein yield) but we found little evidence

Analysis of the protein yield data in dairy cattle confirmed these results since QTL with high effect at the beginning or the end of the lactation and very low in the middle of

livestock: an heteroskedastic model, and models corresponding to several hypotheses concerning the distribution of the QTL substitution effect among the sires: a

• Given a prior π, we show that the maximum additive leakage over all gain functions can be computed in linear time, because this leakage can be expressed as a Kantorovich distance..

The extended zone (define) had a much lower elasticity with respect to travel time savings = 0.75 approximately, with a 12.5% reduction in car trips. 105 Eskeland, Feyzioglu and