HAL Id: hal-00893973
https://hal.archives-ouvertes.fr/hal-00893973
Submitted on 1 Jan 1992
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Identification of a major gene in F1 and F2 data when
alleles are assumed fixed in the parental lines
Llg Janss, Jhj van der Werf
To cite this version:
Original
article
Identification
of
a
major
gene
in
F
1
and
F
2
data when alleles
are
assumed
fixed
in the
parental
lines
LLG
Janss, JHJ Van Der Werf
Wageningen Agricultural University,
Department
of
AnimalBreeding
PO Box 338 6700 AHWageningen,
The Netherlands(Received
9August
1991; accepted
27August 1992)
Summary - A maximum likelihood method is described to
identify
amajor
geneusing
F
2
,
andoptionally Fi,
data of anexperimental
cross. A model which assumed fixation at themajor
locus inparental
lines wasinvestigated by
simulation. Forlarge
data sets(1000 observations)
the likelihood ratio test was conservative andyielded
a type I errorof
3%,
at a nominal level of 5%. The power of the test reached > 95% for additive andcompletely
dominant effects of 4 and 2 residual SDsrespectively.
For smaller data sets,power decreased. In this model
assuming fixation, polygenic
effects may beignored,
but onvarious other
points
the model ispoorly
robust. WhenF
l
data were included any increase in variance fromF
i
toF
2
biased parameter estimates and led toputative
detection ofa
major
gene. When allelessegregated
inparental lines,
parameter estimates were alsobiased,
unless the average allelefrequency
wasexactly
0.5. The model usesonly
thenon-normality
of the distribution due to themajor
gene and corrections fornon-normality
dueto other sources cannot be made. Use of data and models in which alleles
segregate
inparents, eg
F
3
data,
willgive
better robustness and power.cross
/
major gene/
maximum likelihood/
hypothesis testingRésumé - Identification d’un gène majeur en Fi et F2 quand les allèles sont supposés fixés dans les lignées parentales. Cet article décrit une méthode de maximum de vraisemblance pour
identifier
ungène majeur
àpartir
de donnéesF
2
,
et éventuellementF
l
,
d’un croisementexpérimental.
Un modèle supposant un locusmajeur
avec des allèlesfixés
dans leslignées parentales
est étudié à l’aide de simulations. Pour desfichiers
degrande
taille (1 000 observations),
le test du rapport de vraisemblance est conservateur,avec une erreur de
première espèce
de,i%,
à un niveau nominal de 5%. Lapuissance
du testd’identification
d’ungène majeur
atteintplus
de 95% pour deseffets additifs
et de dominancede
4 et
2écarts-types respectivement.
Pour desfichiers
de tailleplus petite,
la
puissance
baisserapidement.
Dans le modèle utilisé la variancepolygénique peut
êtrenégligée
mais sur d’autrespoints
le modèle est peu robuste. Si des donnéesF
i
sontincluses,
touteaugmentation
de la variance entre Fl et F2 introduit un biais sur lesparamètres
dans les
lignées parentales,
lesparamètres
estimés sontégalement
biaisés si lafréquence
allédique
moyenne n’est pas exactement de0,5.
Finalement,
le modèle n’utilise que la nonnormalité de la distribution due au
gène majeur,
et ne peut pascorriger
pour une nonnormalité due à d’autres raisons. L’utilisation d’un modèle ou les allèles
ségrègent
chezles parents, par
exemple
sur des donnéesF
3
,
doit améliorer la robustesse et lapuissance
du test.croisement
/
gène majeur/
maximum de vraisemblance/
test d’hypothèseINTRODUCTION
In animal
breeding,
crosses are used to combine favourable characteristics into onesynthetic
line. It is useful to detect amajor
gene as soon aspossible
in such aline,
because selection could be carried out more
efficiently,
orrepeated
backcrosses bemade. Once a
major
gene has been identified it can also be used forintrogression
in other lines.Major
genes can be identifiedusing
maximum likelihoodmethods,
such assegregation analysis
(Elston
andStewart,
1971;
Morton andMacLean,
1974).
Segregation
analysis
is a universal method and can beapplied
inpopulations
wherealleles
segregate
inparents.
However,
whenapplied
toF
l
,
F
2
or backcross dataassuming
fixation of alleles inparental
lines,
genotypes
ofparents
are assumedknown and all
equal
and thisanalysis
leads to thefitting
of a mixture distributionwithout
accounting
forfamily
structure.Fitting
of mixture distributions has beenproposed
when pure line and backcrossdata as well as
F
i
and2
F
data areavailable,
and whenparental
lines arehomozygous
for all loci
(Elston
andStewart,
1973;
Elston,
1984).
Statisticalproperties
of thismethod, however,
were notdescribed,
and severalassumptions
may not hold. Forexample,
not much is knownconcerning
the power of this method whenonly
F
2
data areavailable,
which is often the case whendeveloping
asynthetic
line.Furthermore, homozygosity
at all loci inparental
lines is not tenable inpractical
animal
breeding.
Here it is assumed that many alleles of smalleffect,
so-calledpolygenes,
aresegregating
in theparental
lines. Alleles at themajor
locus areassumed fixed.
F
l
data couldpossibly
beincluded,
but this is notnecessarily
moreinformative because
F
i
andF
2
generations
may have different means and variances due tosegregating
polygenes.
The aim of this paper is to
investigate
by
simulation some of the statisticalproperties
offitting
mixturedistributions,
such asType
I error, power of the likelihood ratio test and bias ofparameter
estimates whenusing only
F
2
data. Tostudy
theproperties
of themajor
genemodel,
polygenic
variance is not estimated.MODELS USED FOR SIMULATION
A
base-population
of F
1
individuals wassimulated, although
theF
1
generation
may not have had observed records. Consider asingle
locus A with allelesA
l
andA
2
,
where
A
l
hasfrequencies
fp
and1
m
in thepaternal
and maternal line.Genotype
frequencies,
values and numeration aregiven
forF
l
individuals as:Genotypes
ofF
1
animals were allocatedaccording
to thefrequencies
given
aboveusing
uniform random numbers. For theF
2
generation,
genotype
probabilities
werecalculated
given
theparents’
genotypes
using
Mendelian transmissionprobabilities
andassuming
randommating
and no selection. A random environmentalcomponent
e
i was simulated and added to the
genotype.
The observation en individuali(F
l
orF
z
)
withgenotype
r(y.L )
is:with ei distributed
N(O,
0
&dquo;2).
Polygenic
effects are assumed to benormally
dis-tributed. For base individualspolygenic
values weresampled
fromN(O, a 9 2),
wherea§
is
thepolygenic
variance. Norecords
were simulated forF
i
individuals whenpolygenic
effects were included. ForF
2
2 offspring,
phenotypic
observationsy’Ù
weresimulated as:
where Oi
is the Mendeliansampling
term,
sampled
fromN(O,
Q9/
2),
ap and a&dquo;, arepaternal
and maternalpolygenic
values and eijis distributedN(0,
!2).
Additionally,
data were simulated with nomajor
gene orpolygenic
effect:where
ei is distributed./V(0,o!).
A balancedfamily
structure wassimulated,
withan
equal
number ofdams,
nested withinsire,
and anequal
number ofoffspring
foreach dam. Random variables were
generated
by
the IMSL routines GGUBFS foruniform variables and
GGNQF
for normal variables(Imsl,
1984).
MODELS USED FOR ANALYSIS
The test for the presence of a
major
gene is based oncomparing
the likelihood of a model with and without amajor
gene.Polygenic
effects are not included in themodel,
and the model without amajor
gene therefore contains random environmentonly. Apart
frommajor
gene or nomajor
gene, models can account foronly
F
2
data,
Model
for F
2
data with environmentonly
For
F
2
data,
with nobservations,
the model can be written:The
logarithm
of thejoint
likelihood for allobservations,
assuming normality
and uncorrelated errors, is:
Maximising
[5]
withrespect
toQ
andQ
Z
yields
as the maximum likelihood(ML)
estimate for
themean, /3
=E
i
y
i/
n,
and the ML estimate for the variance is!2
=&dquo;E.i(
Yi
-
íJ)
2
/n.
Model for
F
i
and F
2
data with environmentonly
Data on
F
i
andF
2
arecombined,
with nl + n2 = N observations. The observationon
animal j
fromgeneration
i(i
=1, 2)
is:where
!32
is the mean forgeneration
i. Observations forF
I
andF
2
are assumed tohave
equal
environmental variance. Thejoint
log-likelihood
isgiven
as:The
ML estimates for_,O
i
aresimply
the observed means for eachgeneration,
ie
í
3
1
=E!yl!/nl,
andj2
=£;y
2
;/n
2
.
The ML estimate for the variance isModel with
major
gene and environment forF
2
dataWhen alleles are assumed fixed in
parental
lines,
allF
i
individuals are knownto be
heterozygous.
If nopolygenic
effects areconsidered,
this means that allF
2
2individuals have the same
expectation,
andconditioning
onparents
is redundant.In the likelihood for such
data,
summuations over theparents’ possible
genotypes
can be omitted and families can bepooled.
The model isgiven
as:In
[9]
G
i
is thegenotype
of individuali, P
r
denotes theprior
probability
thatG
i
= r, whichequals
1/4, 1/2
and1/4
for r =1,
2 and 3(or
A
I
A
l
, A
l
A
2
andA
2
A
2
).
The total number ofF
2
individuals isgiven
as n, and the functionf
isgiven
as:Model with
major
gene and environment forF
i
andF
2
dataIn the
F
l
generation
only
onegenotype
occurs; henceF
l
data are distributed around asingle
mean, with a varianceequal
to the residual variance in theF
2
generation.
Due topossible
heterosis shownby
thepolygenes
aseparate
mean ismodelled,
but thepossible heterogeneity
in variance causedby
polygenes
is not accounted for. The model forindividual j
fromgeneration
i forgenotype
r is:where
/3i
is a fixed effect forgeneration
i. Model[11]
isoverparameterised
becausegenotype
means and 2general
means are modelled. We chose toput
/?
2
= 0. Inthat case the mean of
F
i
individuals,
which all have knowngenotype
r =2,
can bewritten as !F1 =
U2
+,3
1
.
Thejoint
log-likelihood
forF
i
andF
2
data,
using
!,F1 is:where nl and n2 are
number of
observations in theF
l
andF
2
generation.
The MLestimate for pfi is
equal
to!31
in!6).
ML estimates for
!C,.(r
=1, 2, 3)
andQ
2
in models
[8]
and[11]
cannot begiven
explicitly.
Theseparameters
were estimatedby
minimising
minuslog-likelihood
L
2
in
[9]
andL2
in
!12!,
using
aquasi-Newton
minimisation routine. Areparameter-isation was made
using
the difference betweenhomozygotes
t =A3
- iii, and a
relative dominance coefficient d =
(!2 - !i)/t,
as in Morton and MacLean(1974).
By
experience,
thisparameterisation
was found moreappropriate
than theparam-eterisation
using
3 means iLl, !2 and J.l3, because convergence isgenerally
reached faster due to smallersampling
covariances between the estimates. The mean waschosen as the
midhomozygote value: a
=1 /2p
i
+1/2/!3.
Parameters t and d are easier to
interpret
than 3 means, and therefore results arealso
presented using
theseparameters.
Parameter t indicates themagnitude
of themajor
gene effect and can beexpressed
eitherabsolutely
or in units of the residualstandard deviation. Parameter t was constrained to be
positive,
which isarbitrary
because the likelihood for the
parameters p,
t and d isequal
to the likelihood forthe
parameters
p, -t and(1-d).
Parameter d was estimated in the interval[0,1].
Problems were detected when this constraint was not
used,
because t could becomezero,
leading
toinfinitely large
estimates for d. This occurredfrequently
when theeffects where small and dominant. Minimisation
by
IMSL routine ZXMIN(Imsl,
1984)
specified
3significant
digits
in the estimatedparameters
as the convergenceHYPOTHESIS
TESTING
The null
hypothesis
(H
o
)
is &dquo;nomajor
geneeffect&dquo;,
whereas the alternativehypothesis
(H
i
)
is &dquo;amajor
gene effect ispresent&dquo;.
Thelog-likelihoods
L
I
in[5]
andL
2
in[9]
are the likelihoods for eachhypothesis
whenonly
F
2
data arepresent.
When
F
l
data are included the likelihoodsLi
in
[7]
andL*
in
[12]
apply.
A likelihoodratio test is used to
accept
orreject H
o
.
Twice thelogarithm
of the likelihood ratiois
given
as:Two
important
aspects
of any test are thetype
I andtype
II errors. Thetype
I error is the
percentage
of cases in whichH
o
isrejected, although
it is true. TheH
o
model is simulatedby
(3!.
Thetype
II error is thepercentage
of cases in whichH
l
isrejected, although
it was true.Here,
thetype
II error is notused,
but itscomplement,
the power, which is thepercentage
of cases in whichH
l
isaccepted,
when
H
l
is true. TheH
I
model is simulatedby
model(1!.
Fixation of alleles inparental
lines is simulatedby taking
fp
= 1 andf
m
= 0.Type
I errorThe distribution of T when
H
o
is true isexpected asymptotically
tobe x2
with
2
degrees
offreedom,
because theH
I
model has 2parameters
more than theH
o
model
(Wilks, 1938).
Since inpractice
data sets arealways
of finitesize,
it isinteresting
to know whether and when the distribution of Tis closeenough
to theexpected asymptotic distribution,
so thatquantiles
from ax
2
distribution
can beused as critical values.
Type
I errors were estimated for data sets of 100 up to 2 000observations,
simulating
1 000replicates
for each size of data set. Three critical values wereused,
corresponding
to nominal levels of10,
5 and1%.
The nominal level is defined as theexpected
error rate, based on theasymptotic
distribution. Exact binomialprobabilities
were used to test whether the estimates differedsignificantly
from the nominal level. When the observed number of
significant replicates
does notdiffer
significantly,
a x
z
distribution is considered suitable toprovide
critical values.Also,
when the observed number is lower thanexpected
theasymptotic
distributionmight
remain useful. The nominaltye
I error is in that case an upper bound forthe real
type
I error.Power of the test and estimated
parameters
The power is
investigated
for additive(d
=0.5)
andcompletely
dominant(d
=1)
effects,
with a residual variance of100,
and tvarying
from10-40,
ie from 1 to4 SDs. The additive
genetic
variance causedby
this locusequals
t
2
/8,
when t isabsolute.
Heritability
in the narrow sense therefore varies from 0.11-0.67. Each dataset contained 1 000
observations,
and each situation wasrepeated
100 times. Thepower of the test for smaller data sets was
investigated
for onerelatively
small effectRobustness
Investigation
of thetype
I error and the power considered situations where eitherH
o
orH
l
wastrue,
satisfying
allassumptions
in the models. The robustness of thistest and usefulness of the
assumption
of fixation inparents
forparameter
estimationwas
investigated
for situations which violate 2assumptions:
- when there is
a covariance between error terms. This was induced
by
simulationof
polygenic
varianceby
model(2].
The total variance was held constant at100,
sothat the power of the test could not
change
due to achange
in totalvariance;
- when fixation of alleles is
not the case. The data were simulated
by
model(1],
in which
fp
andf
m
were notequal
to 0 and1,
resulting
insegregation
of allelesin the
F
l
parents.
Firstly,
3situations
were simulated where the average allelefrequency
remains 0.5. In that caseonly
theassumption
that allF
1
parents
areheterozygous
was violated.Secondly,
3 situations were simulated where the averageallele
frequency
was not 0.5. In that case, theassumption
thatgenotype
frequencies
in
F
2
are1/4, 1/2,
and1/4
was also violated.Inclusion
of F
l
dataA
major
gene which startssegregating
in theF
2
notonly
renders the distributionnon-normal,
but also increases thephenotypic
variance in theF
2
relative to theF
i
.
WhenF
i
data areincluded,
this increase in variance may be taken assupplementary
evidence,
apart
from anynon-normality,
for the existence of amajor
gene.Assessing
the relative
importance
of the 2 sources of information is useful so as tojudge
the robustness of the model
including
F
i
data. The effects onnon-normality
and increasedF
2
variance due to themajor
gene should therefore bedistinguished.
Thiswas
accomplished by simulating
different residual variances inF
I
andF
2
.
Four situations wereinvestigated, combining
all combinations ofnon-normality
inF
2
and increased variance inF
2
(table I).
Ingeneral,
500F
i
and 1000F
2
observationswere simulated. For situation
3,
data sets with 1000F
I
and 1000F
2
observationswere also
investigated.
Data for situations 1 and 3 were simulatedby
model(3],
RESULTS
Type
I error andparameter
estimates under the nullhypothesis
Estimated
type
I errors, based on 1 000replicates,
have beengiven
in table II fordifferent sizes of the data set. Estimates
decreased,
and more or less stabilised whenthe size of the data set exceeded 1 000
observations,
especially
for a nominal levelof
10%,
which were most accurate. For theselarge
data sets,however,
thetype
Ierrors were too low
(P
<0.01),
which means that critical values obtained from aX
’2
distribution wouldprovide
a too conservative test. Forexample,
application
of theX2
95-percentile
to data sets with 1 000 observations will not result in theexpected
type
I error of5%,
but rather in atype
I error of x53%.
When no
major
gene effect waspresent,
stil on average a considerable effectcould be found. Parameter estimates for the
major
gene model have beengiven
intable
III,
simulating
just
anormally
distributed error effect with variance 100. Theempirical
standard deviation for estimated t-valuesranged
between7(N
=100)
and
5(N
=2000) (not
intable).
Theaverage estimate for t is therefore
biased,
and many of the individual estimates weresignificantly
different from zero if a t-test wasapplied.
The average estimated d is0.5,
which isexpected
because the simulateddistribution was
symmetrical.
Parameter estimates and power of the test
Results for the different situations studied under a
major
gene model are in table IV.The
x) 95-percentile
was used as critical value for the test. The power reached over95%
for additive effects(d
=0.5)
with a t-value of40,
which is 4 a(residual
standard
deviations).
Forcompletely
dominant effects(d
=1),
100%
power was
reached for an effect of t = 20
(2a).
Phenotypic
distributions for these 2 cases areunimodal,
although
not normal(fig
1).
For small
genetic
effects(t
!10,
ie1
Q
)
t wasoverestimated,
inparticular
whend = 1 and was underestimated for d = 0.5. For d =
0.5,
average estimates for t and d differed from the simulated values
by
<1%
when the power reached near100%.
For d =
1, however,
the bias in t was still10%
when thepower had reached
100%.
This bias reduced
gradually,
and was <1%
for agenetic
effect of t = 40.In
figure
2 power of the test isdepicted
forvarying
sizes of the data set. Two additive effects werechosen,
with t = 25 and t = 35. Eachpoint
in thefigure
is on average of 100
replicates.
The power increased withincreasing
number ofobservations.
Increasing
the number of observations > 1000 gaverelatively
lessof observations this
graph
isexpected
to level off at thetype
I error(nominally
5%),
butsampling
makes results somewhat erratic.Robustness when
ignoring polygenic
varianceData
following
model[2]
were simulated with d = 0.5 and t = 35 and differentproportions
ofpolygenic
and residual variance. The data set contained 20 sires with 5 dams each and 10offspring
perdam;
each situation wasrepeated
100 times.Estimated parameters and
resulting
power are in table V. Parameter estimates fort and
d,
and the power of the test were not affected when apart
of the variance waspolygenic.
The total estimated variance wasequal
to the sum of simulatedvariances.
Robustness when
ignoring segregation
in theparental
linesData
following
model[1]
were simulated with d =0.5,
t =35,
Q
2
= 100 and variousvalues for
fp
andfm.
Thegenotype
probabilities
inparents
(F
1
)
andoffspring
(F
2
)
have beengiven
in table VI. For the first 3situations,
genotype
probabilities
in theF
j
were1/=1, 1/2
and1/4,
as assumed under the fixationassumption.
For thefrequency
was not 0.5 on average.High
average allelefrequencies
weresimulated,
but because
only
additive effects areconsidered,
results areequally
valid for lowallele
frequencies.
The power remainedequal,
aslong
asgenotype
probabilities
inF
2
remained1/4, 1/2
and1/4,
and parameter estimates are unbiased(table VII).
In case the allele
frequency
did not average0.5, however,
parameter
estimates werebiased. The power of the test
increased,
because in this situation the distributionbecame skewed. The situation with d = 0.5 and t = 35 for data where the gene is fixed in
parental
lines(table
IV),
with a power of82%,
may serve as a reference.Inclusion
of F,
dataFive
hundred,
or1 000,
F
1
observations were alsosimulated,
with additivemajor
gene effects
(table VIII).
With nomajor
gene effect(t
= 0 and hencea
7
71 2
=
0),
andwith
equal
variances inF
i
andF
2
(situation 1)
the average estimated t was muchsmaller than in the model
using only
F
2
data(table III).
In the second situationgiven
major
gene variance of 50. Whenusing
only
data,
F
2
the test had a power ofonly
12%
for detection of an additive effect of t = 20(table IV).
Whenincluding
F
i
data, however,
the power was100%
(table VIII).
From the situations 3 and 4 considered in tableVIII, however,
it becomesapparent
that whenF
I
data wereincluded,
themajor
gene was detectedonly by
its effect onvariance,
considering
apower near the
type
I error rate as irrelevant. When the variance inF
2
increasedby
50%,
but when in fact nomajor
gene waspresent,
amajor
gene was found in100%
of the cases. For smaller increases of the variance(10%)
major
genes were stilldetected,
and theprobability
of detection increased with the size of the data set(alternative
3* with moreF
i
observations).
Amajor
gene wastotally undetectable,
on the otherhand,
when the total variance inF
i
wasequal
to the total variance inF
2
(situation
4).
This shows that theability
to detect amajor
gene can even beworsened when
F
I
data are included. Ifonly
F
2
data were used amajor
gene withsimilar effect was detected in
12%
of the cases(table IV).
DISCUSSION AND CONCLUSIONS
Type
I errorNominal levels for
type
I errors were based on Wilks(1938)
whoproved
asymptotic
convergence of the likelihood ratio test statistic to
a
X
2
distribution.Type
I errorsdecreased and stabilised for
larger
datasets,
asexpected.
The estimatedtype
Ierrors,
however,
weresignificantly
too low. It isunlikely
that thetype
I error,after
having
firstdecreased,
would increase for evenlarger
data sets as studiedhere. It can be concluded
therefore,
thattype
I errors aresignificantly
lower thanexpected
in theasymptotic
case, and that forlarge
data sets the likelihood ratiotest is conservative. It has been
investigated
whether the constraint used on thedominance coefficient could have caused the too low
type
I errors.However,
this was not the case, because even with noconstraint,
too lowtype
I errors were foundFor the
investigation
of power we have chosen to use the theoreticalasymptotic
quantiles, although they
were shown togive
a conservative test. The nominal level for thetype
I error is then an upperbound,
and theexperimenter
still has areasonably
good
idea of the risk ofmaking
atype
I error. When the actualtype
I error would be above the
expected level, however,
the test would become of lessuse.
A second reason for still
using
theoreticalasymptotic quantiles
is thatadapting
the test is difficult and of littlepractical
use. Adifficulty
is,
forinstance,
thatestimated
quantiles
would besubject
tosampling
and the obtainedpoint
estimate is thereforeonly expected
togive
the correct test.Therefore,
2experimenters
investigating
the sametest,
will find different critical values and the testapplied
willdepend
on theexperimenter.
Also inpractice
such aprocedure
would be difficultto
apply
since the calculatedquantile
wouldonly
hold for the same model and datasets of similar size and structure.
Power of the test
Using
only F
2
data,
the power of this test was poor for additive effects(dominance
coefficient =
0.5).
This can beexplained by
theresulting
symmetrical
distributionwhich is similar to the distribution under
H
o
.
In this case, thegenetic
effect hasto be about 4Q to be
detectable,
whichcorresponds
to aheritability
of 0.67in
theF
2
generation.
When the dominance coefficient is1,
an effect of 2Q was detectable.These results are based on data sets with 1000
observations,
but it was shown thatthe power decreased
dramatically
for smaller data sets.Power increased when
F
I
data was included in theanalysis,
and additive effects of 2a could be detected. In that case the increase in variance inF
2
,
causedby
the
major
gene, was taken as animportant
indication for the presence of amajor
gene. The power to detect a
major
gene inF
2
data may also increase if alleles werenot fixed in the
parental lines,
oralternatively F
3
,
instead ofF
2
,
data were used. Thiscorresponds
more to the situation in a usualpopulation,
wherebetween-family
variation will arise. For
F
3
data,
forexample,
when pure lines werehomozygous,
the allele
frequency
will be0.5,
andparents
will be inHardy-Weinberg equilibrium.
For such asituation,
LeRoy
(1989)
found a power of25%
for an additive effect of 2u in a data set of 400 observations(20
sires with 20 half-siboffspring each).
Infigure
2,
the power for a data set of similar size can be seen to beonly !
10%
for an even
larger
effect of2.5!(t
=25).
This indicates that an increase in powermay
be
expected
when theF
3
generation
isobserved,
despite
the facts that moreparameters
have to beestimated,
and thatparents’
genotypes
are nolonger
known.The power for detection of a
major
gene is related to theunexplained
variancein the model of
analysis.
The inclusion of fixed andpolygenic
effects will therefore make themajor
gene easier todetect, provided
that all these effects can beaccurately
estimated.Parameter estimates
For additive effects simulated
(d
=0.5),
bias for the average estimatedgenetic
effectFor dominant effects
(d
=1),
however,
t was overestimatedby
10%
when the powerfor detection of a
major
gene reached100%.
This overestimate isprobably
related tothe underestimate for
d,
which resulted from theapplied
constraint. Asmentioned,
this constraint was
applied
toprevent
t fromgoing
to zero, at whichpoint
d tendedto go to
infinity.
When such a constraint was notapplied
with,
forinstance,
aneffect of t = 10 and d =
1,
analyses
gave in 100
replicates
an average estimated dof 2.93. This is an average overestimate of !
200%.
The average estimateusing
theconstraint was
0.93,
showing
that indeed better estimates were obtained under therestriction,
even when the true value was on the border of the allowedparameter
space. In
practice,
of course, overdominance cannot be excluded andparameter
estimates could be
compared
with and without this constraint. Asmall,
near zero,estimate for t and a
large
estimate for d wouldsuggest
apossible
overestimationof d.
For very small or absent
effects,
the ML estimates wereconsiderably
biased.In this
situation,
theasymptotic properties
of MLestimates,
ieconsistency,
arefar from
being
attained. In the absence of amajor
gene, average estimates werepresented
forincreasing
size of the data set. This showed that the average estimatedecreased,
and willprobably
reach the true value when the number of observations isvery much
larger.
Bias of ML estimates in finitesamples
also resulted insignificant
t-values when no effect was present. This indicates that the presence of a
major
gene should not bejudged
by
the estimates and their standard errors. The standarderrors discussed here were
empirical
standard errors. Inpractice
such standard errors will have to be obtainedusing
the inverse of an estimated Hessianmatrix,
or some otherquadratic approximation
of the likelihood curve near theoptimum.
Using
the estimated Hessianmatrices,
we foundroughly
the same standard errors,although they
were not very accurate. In ourstudy,
thequasi-Newton algorithm
was started close to theoptimum
and notenough
iterations are then carried out toestimate the Hessian matrix
accurately.
Robustness of model and testInclusion of
F
l
data results in apoorly
robust test when differences in variances would arise between theF
i
andF
2
due to other causes than amajor
gene. Anincrease in variance from
F
i
toF
2
can result in aputative major
genebeing
detected. An increase in variance of
10%,
forinstance,
gave25%
false detections when 1000F
i
and 1000F
2
observations were combined. Such increases are notunlikely,
dueto,
forinstance,
polygenes.
Themajor
gene test is thenmerely
atest for
homogeneous
variance inF
1
andF
z
.
The inclusion ofF
l
data could alsoworsen the detection of a
major
gene, when the environmental variance inF
2
wasless. Therefore any differences in
variance,
due to other causes than themajor
geneeffect,
will bias theparameter
estimates. Also in a model that allows forsegregation,
such biases will remain.
’
It was shown that the model is robust when
polygenic
effects wereignored.
This can be
explained
by
the fact that the test usesonly
thenon-normality
of thedistribution as a criterion. It must be noted however
that,
whenpolygenic
effectscan be
accurately estimated,
including
apolygenic
effect in the model will increaseAnother
aspect
of robustness concerns theassumption
of fixed alleles inparental
lines. It was shown that
parameter
estimates were not biased when allelessegre-gated,
aslong
as the averagefrequency
in the 2 lines was 0.5. In that case theas-sumed
fitting
proportions
1/4, 1/2
and1/4
are still correct. If the averagefrequency
in
parental
lines differed from0.5,
t was underestimatedand,
because skewness wasintroduced,
estimates for d deviated from 0.5. This second situation is morelikely
to occur than the situation where the average
frequency
isexactly
0.5. Becauseit could be difficult to
justify
the fixationassumption
apriori,
application
of amore
general
model that allows forsegregation
inparental
linesmight
have to be considered.A final
aspect
of robustness concernsnon-normality
of the distribution not dueto a
major
gene. As
stated earlier a mixture distribution is fitted and the detectionof a
major
gene inF
z
data,
assuming
fixation,
reliessolely
on thenon-normality
caused
by
themajor
gene. This means that in factonly
asignificant
non-normality
is proven. The method would therefore be
poorly
robustagainst
anynon-normality
due to another cause. The robustness
might
beimproved using
data in whichalleles
segregate
inparents.
This isguaranteed
inF
3
data,
but may also arise inF
a
data,
when alleles were not fixed inparental
lines. Ifsegregation
inparents
is the case, evidence for amajor
gene is nolonger only
in thenon-normality
of the overalldistribution,
but also for instance inheterogeneous
withinfamily
variances. Therefore a model that allows forsegregation
is notonly preferred
to increasepower, but is also
preferred
toimprove
robustness.ACKNOWLEDGMENTS
This research was
supported
financially
by
the Dutch Product Board for Livestock andMeat,
the DutchPig
HerdbookSociety,
BovarBV,
EuribridBV,
Fomeva BV and the VOC Nieuw-Dalland BV. ProfsBrascamp
and Grossman and Drs Van Arendonk and Van Putten areacknowledged
forhelpful
comments and editorialsuggestions.
Comments of 2 referees havehelped
to shorten themanuscript
and toimprove
the discussion section.REFERENCES
Elston
RC,
Stewart J(1971)
Ageneral
model for thegenetic
analysis
ofpedigree
data. Hum Hered21,
523-542Elston
RC,
Stewart J(1973)
Theanalysis
ofquantitative
traits forsimple genetic
models fromparental,
F
1
and backcross data. Genetics73,
695-711 1Elston RC
(1984)
Thegenetic analysis
ofquantitative
trait differences between twohomozygous
lines. Genetics108,
733-744INISL
(1984)
Library Reference
Manual Edition 9.2. International and StatisticalLibraries, Houston,
TX ,Le
Roy
P(1989)
Methodes de detection degenes
majeurs;
application
aux animauxdomestiques.
DoctoralThesis,
Universite deParis-Sud,
Centred’Orsay
Morton
NE,
MacLean CJ(1974)
Analysis
offamily
resemblance III.Complex
segregation
ofquantitative
traits. Am J Hum Genet26,
489-503Wilks SS