Identification of a major gene in F1 and F2 data when alleles are assumed fixed in the parental lines

(1)

HAL Id: hal-00893973

https://hal.archives-ouvertes.fr/hal-00893973

Submitted on 1 Jan 1992

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Identification of a major gene in F1 and F2 data when

alleles are assumed fixed in the parental lines

Llg Janss, Jhj van der Werf

To cite this version:

(2)

Original

article

Identification

of

a

major

_gene

in

_F

₁

and

_F

₂

data when alleles

are

assumed

fixed

in the

parental

lines

LLG

Janss, JHJ Van Der Werf

Wageningen Agricultural University,

Department

of

Animal

_Breeding

PO Box 338 6700 AH

Wageningen,

The Netherlands

(Received

9

_August

_{1991; accepted}

27

_{August 1992)}

Summary - A maximum likelihood method is described to

_identify

a

_major

_gene

_using

F

2 ,

and

_{optionally Fi,}

data of an

_experimental

cross. A model which assumed fixation at the

_major

locus in

_parental

lines was

_{investigated by}

simulation. For

large

data sets

(1000 observations)

the likelihood ratio test was conservative and

_yielded

a _typeI error

of

_3%,

at a nominal level of 5%. The power of the test reached > 95% for additive and

completely

dominant effects of 4 and 2 residual SDs

_{respectively.}

For smaller data sets,

power decreased. In this model

assuming fixation, polygenic

effects may be

ignored,

but on

various other

_points

the model is

_poorly

robust. When

F

l

data were included any increase in variance from

_F

_i

to

_F

₂

biased parameter estimates and led to

putative

detection of

a

_major

_{gene. When alleles}

_segregated

in

_{parental lines,}

_parameterestimates were also

biased,

unless the average allele

frequency

was

_exactly

0.5. The model uses

_only

the

non-normality

of the distribution due to the

_major

_{gene and}corrections for

_{non-normality}

due

to other sources cannot be made. Use of data and models in which alleles

_segregate

in

parents, eg

F

3 data,

will

give

better robustness and _power.

cross

_/

_major_gene

_/

maximum likelihood

_/

hypothesis testing

Résumé - Identification d’un _gènemajeur en Fi et F2 quand les allèles sont supposés fixés dans les lignées parentales. Cet article décrit une méthode de maximum de vraisemblance _pour

_identifier

un

_{gène majeur}

à

_partir

de données

F

2 ,

et éventuellement

F

l

,

d’un croisement

_{expérimental.}

Un modèle supposant un locus

majeur

avec des allèles

fixés

dans les

_{lignées parentales}

est étudié à l’aide de simulations. Pour des

_fichiers

de

grande

taille (1 000 observations),

le test du rapport de vraisemblance est conservateur,

avec une erreur de

_{première espèce}

de

,i%,

à un niveau nominal de 5%. La

_puissance

du test

_{d’identification}

d’un

_{gène majeur}

atteint

_plus

de 95% pour des

effets additifs

et de dominance

de

4 et

2

_{écarts-types respectivement.}

Pour des

_fichiers

de taille

plus petite,

la

_puissance

baisse

_rapidement.

Dans le modèle utilisé la variance

_{polygénique peut}

être

négligée

mais sur d’autres

_points

le modèle est peu robuste. Si des données

F

i

sont

_incluses,

toute

_augmentation

de la variance entre Fl et F2 introduit un biais sur les

paramètres

(3)

dans les

_{lignées parentales,}

les

paramètres

estimés sont

_également

biaisés si la

_fréquence

allédique

moyenne n’est pas exactement de

0,5.

Finalement,

le modèle n’utilise que la non

normalité de la distribution due au

_{gène majeur,}

et ne _peut_pas

_corriger

_pourune non

normalité due à d’autres raisons. L’utilisation d’un modèle ou les allèles

_ségrègent

chez

les parents, par

exemple

sur des données

_F

₃

_,

doit améliorer la robustesse et la

_puissance

du test.

croisement

_/

gène majeur

/

maximum de vraisemblance

_/

test _{d’hypothèse}

INTRODUCTION

In animal

_breeding,

crosses are used to combine favourable characteristics into one

synthetic

line. It is useful to detect a

_major

_geneas soon as

possible

in such a

line,

because selection could be carried out more

_efficiently,

or

_repeated

backcrosses be

made. Once a

_major

_{gene has been}identified it can also be used for

_{introgression}

in other lines.

Major

genes can be identified

_using

maximum likelihood

_methods,

such as

segregation analysis

(Elston

and

_Stewart,

_1971;

Morton and

_MacLean,

_1974).

Segregation

analysis

is a universal method and can be

_applied

in

_populations

where

alleles

_segregate

in

_parents.

_However,

when

_applied

to

_F

_l

_,

F

2

or backcross data

assuming

fixation of alleles in

_parental

_lines,

_genotypes

of

_parents

are assumed

known and all

_equal

and this

_analysis

leads to the

_fitting

of a mixture distribution

without

_accounting

for

_family

structure.

Fitting

of mixture distributions has been

proposed

when pure line and backcross

data as well as

_F

_i

and

2 F

data are

_available,

and when

_parental

lines are

_homozygous

for all loci

_(Elston

and

_Stewart,

_1973;

_Elston,

_1984).

Statistical

_properties

of this

method, however,

were not

_described,

and several

_assumptions

_maynot hold. For

example,

not much is known

concerning

the power of this method when

only

F

2

data are

_available,

which is often the case when

developing

a

_synthetic

line.

Furthermore, homozygosity

at all loci in

_parental

lines is not tenable in

_practical

animal

_breeding.

Here it is assumed that many alleles of small

effect,

so-called

polygenes,

are

_segregating

in the

_parental

lines. Alleles at the

_major

locus are

assumed fixed.

_F

_l

data could

_possibly

be

_included,

but this is not

_necessarily

more

informative because

F

i

and

F

2 generations

may have different means and variances due to

_segregating

_polygenes.

The aim of this _{paper is}to

_investigate

_by

simulation some of the statistical

properties

of

_fitting

mixture

_{distributions,}

such as

_Type

I _error,_powerof the likelihood ratio test and bias of

_parameter

estimates when

_{using only}

F

2

data. To

study

the

_properties

of the

_major

gene

model,

polygenic

variance is not estimated.

(4)

MODELS USED FOR SIMULATION

A

_{base-population}

of F

1

individuals was

_{simulated, although}

the

_F

₁

_generation

may not have had observed records. Consider a

single

locus A with alleles

A

l

and

A

2 ,

where

_A

_l

has

_frequencies

_fp

and

₁

_m

in the

_paternal

and maternal line.

_Genotype

frequencies,

values and numeration are

_given

for

_F

_l

individuals as:

Genotypes

of

_F

₁

animals were allocated

according

to the

_frequencies

_given

above

using

uniform random numbers. For the

F

2 generation,

genotype

probabilities

were

calculated

given

the

_parents’

_genotypes

_using

Mendelian transmission

_{probabilities}

and

_assuming

random

_mating

and no selection. A random environmental

component

e

i was simulated and added to the

_genotype.

The observation en individual

i(F

l

or

F

z

)

with

_genotype

_{r(y.L )}

is:

with ei distributed

N(O,

0 &dquo;2).

Polygenic

effects are assumed to be

_normally

dis-tributed. For base individuals

_polygenic

values were

sampled

from

N(O, a 9 2),

where

a§

is

the

_polygenic

variance. No

records

were simulated for

F

i

individuals when

polygenic

effects were included. For

F

2 2 offspring,

phenotypic

observations

y’Ù

were

simulated as:

where Oi

is the Mendelian

_sampling

_term,

sampled

from

_N(O,

_Q9/

_2),

_apand a&dquo;, are

paternal

and maternal

_polygenic

values and _e_ijis distributed

_N(0,

_!2).

_{Additionally,}

data were simulated with no

_major

_geneor

_polygenic

effect:

where

ei is distributed

./V(0,o!).

A balanced

family

structure was

_simulated,

with

an

_equal

number of

dams,

nested within

_sire,

and an

equal

number of

offspring

for

each dam. Random variables were

_generated

_by

the IMSL routines GGUBFS for

uniform variables and

GGNQF

for normal variables

_(Imsl,

_1984).

MODELS USED FOR ANALYSIS

The test _{for the presence}of a

_major

_{gene is}based on

_comparing

the likelihood of a model with and without a

_major

_gene.

Polygenic

effects are not included in the

model,

and the model without a

_major

_genetherefore contains random environment

only. Apart

from

_major

gene or no

major

_gene,models can account for

_only

F

2 data,

(5)

Model

for F

2

data with environment

_only

For

_F

₂

_data,

with n

_{observations,}

the model can be written:

The

_logarithm

of the

_joint

likelihood for all

observations,

assuming normality

and uncorrelated errors, is:

Maximising

[5]

with

_respect

to

_Q

and

Q

Z

_yields

as the maximum likelihood

(ML)

_{estimate for}

the

mean, /3

=

E

i

y

i/

n,

and the ML estimate for the variance is

!2

=

&dquo;E.i(

Yi

-

íJ)

2 /n.

Model for

F

i

and F

2

data with environment

_only

Data on

F

i

and

F

2

are

_combined,

with nl + n2 = N observations. The observation

on

_{animal j}

from

_generation

_i(i

=

1, 2)

is:

where

_!32

is the mean for

_generation

i. Observations for

_F

_I

and

_F

₂

are assumed to

have

_equal

environmental variance. The

_joint

_{log-likelihood}

is

_given

as:

The

ML estimates for

_{_,O}

_i

are

simply

the observed means for each

generation,

ie

í

3

1

=

E!yl!/nl,

and

j2

=

£;y

2 ;/n

2 .

The ML estimate for the variance is

Model with

_major

_geneand environment for

_F

₂

data

When alleles are assumed fixed in

_parental

_lines,

all

_F

_i

individuals are known

to be

_{heterozygous.}

If no

polygenic

effects are

_considered,

this means that all

_F

₂

2

individuals have the same

_expectation,

and

_conditioning

on

_parents

is redundant.

In the likelihood for such

_data,

summuations over the

parents’ possible

_genotypes

can be omitted and families can be

_pooled.

The model is

_given

as:

(6)

In

_[9]

_G

_i

is the

_genotype

of individual

i, P

r

denotes the

_prior

_probability

that

G

i

= _r,which

equals

1/4, 1/2

and

1/4

for _r=

_1,

2 and 3

(or

A

I

A

l

, A

l

A

2

and

A

2 A

2 ).

The total number of

F

2

individuals is

_given

as n, and the function

f

is

given

as:

Model with

_major

_geneand environment for

_F

_i

and

_F

₂

data

In the

F

l

generation

only

one

_genotype

occurs; hence

F

l

data are distributed around a

_single

_mean,with a variance

_equal

to the residual variance in the

F

2 generation.

Due to

_possible

heterosis shown

_by

the

_polygenes

a

_separate

mean is

_modelled,

but the

_{possible heterogeneity}

in variance caused

_by

_polygenes

is not accounted for. The model for

_{individual j}

from

_generation

i for

_genotype

r is:

where

_/3i

is a fixed effect for

_generation

i. Model

_[11]

is

_{overparameterised}

because

genotype

means and 2

_general

means are modelled. We chose to

_put

_/?

₂

= _{0. In}

that case the mean of

_F

_i

_individuals,

which all have known

_genotype

r =

2,

can be

written as _!F1=

U2

+,3

1 .

The

joint

log-likelihood

for

F

i

and

F

2 data,

using

!,F1 is:

where _n_l and n2 are

_{number of}

observations in the

F

l

and

F

2 generation.

The ML

estimate _{for p}_fi_is

_equal

to

!31

in

_!6).

ML estimates for

_!C,.(r

=

1, 2, 3)

and

Q

2 in models

_[8]

and

_[11]

cannot be

_given

explicitly.

These

_parameters

were estimated

_by

_minimising

minus

_{log-likelihood}

_L

₂

in

_[9]

and

_L2

_in

_!12!,

using

a

_quasi-Newton

minimisation routine. A

reparameter-isation was made

_using

the difference between

_homozygotes

t =

A3

- iii, and a

relative dominance coefficient d =

(!2 - !i)/t,

_asin Morton and MacLean

(1974).

By

experience,

this

_{parameterisation}

was found more

_appropriate

than the

param-eterisation

_using

3 means _i_Ll_,_!2and _J_.l₃_,because _{convergence is}

_generally

reached faster due to smaller

_sampling

covariances between the estimates. The mean was

chosen as the

_{midhomozygote value: a}

=

1 /2p

i

₊

1/2/!3.

Parameters t and d are easier to

_interpret

than 3 means, and therefore results are

also

_{presented using}

these

_parameters.

Parameter t indicates the

_magnitude

of the

major

gene effect and can be

_expressed

either

_absolutely

or in units of the residual

standard deviation. Parameter t was constrained to be

_positive,

which is

arbitrary

because the likelihood for the

_{parameters p,}

t and d is

_equal

to the likelihood for

the

_parameters

_{p, -t}and

_(1-d).

Parameter d was estimated in the interval

[0,1].

Problems were detected when this constraint was not

_used,

because t could become

zero,

leading

to

_{infinitely large}

estimates for d. This occurred

_frequently

when the

effects where small and dominant. Minimisation

_by

IMSL routine ZXMIN

_(Imsl,

1984)

specified

3

_significant

_digits

in the estimated

_parameters

as the convergence

(7)

HYPOTHESIS

TESTING

The null

_hypothesis

_(H

_o

₎

is &dquo;no

_major

gene

effect&dquo;,

whereas the alternative

hypothesis

(H

i

)

is &dquo;a

_major

_geneeffect is

_{present&dquo;.}

The

_{log-likelihoods}

_L

_I

in

_[5]

and

L

2

in

_[9]

are the likelihoods for each

hypothesis

when

only

F

2

data are

present.

When

_F

_l

data are included the likelihoods

_Li

_in

[7]

and

_L*

_in

[12]

_apply.

A likelihood

ratio test is used to

_accept

or

_{reject H}

_o

_.

Twice the

_logarithm

of the likelihood ratio

is

_given

as:

Two

_important

_aspects

of any test are the

_type

I and

_type

II errors. The

_type

I error is the

_percentage

of cases in which

H

o

is

_{rejected, although}

it is true. The

H

o

model is simulated

_by

_(3!.

The

_type

II error is the

_percentage

of cases in which

H

l

is

_{rejected, although}

it was true.

_Here,

the

_type

II error is not

_used,

but its

complement,

the _power,which is the

_percentage

of cases in which

_H

_l

is

_accepted,

when

H

l

is true. The

H

I

model is simulated

_by

model

_(1!.

Fixation of alleles in

parental

lines is simulated

_{by taking}

_fp

= 1 and

f

m

= 0.

Type

I error

The distribution of T when

_H

_o

is true is

_{expected asymptotically}

to

_{be x2}

with

2

_degrees

of

_freedom,

because the

_H

_I

model has 2

_parameters

more than the

H

o

model

_{(Wilks, 1938).}

Since in

_practice

data sets are

_always

of finite

_size,

it is

interesting

to know whether and when the distribution of Tis close

enough

to the

expected asymptotic distribution,

so that

quantiles

from a

x

2 distribution

can be

used as critical values.

_Type

I errors were estimated for data sets of 100 up to 2 000

observations,

simulating

1 000

_replicates

for each size of data set. Three critical values were

_used,

_{corresponding}

to nominal levels of

_10,

5 and

1%.

The nominal level is defined as the

_expected

error _rate,based on the

asymptotic

distribution. Exact binomial

_{probabilities}

were used to test whether the estimates differed

_{significantly}

from the nominal level. When the observed number of

_{significant replicates}

does not

differ

_{significantly,}

_{a x}

_z

distribution is considered suitable to

_provide

critical values.

Also,

when the observed number is lower than

_expected

the

_asymptotic

distribution

might

remain useful. The nominal

_tye

I error is in that case an upper bound for

the real

type

I error.

Power of the test and estimated

_parameters

The power is

_investigated

for additive

_(d

=

0.5)

and

_completely

dominant

(d

=

1)

effects,

with a residual variance of

_100,

and t

_varying

from

_10-40,

ie from 1 to

4 SDs. The additive

_genetic

variance caused

_by

this locus

equals

t

2 /8,

when t is

absolute.

_Heritability

in the narrow sense therefore varies from 0.11-0.67. Each data

set contained 1 000

_{observations,}

and each situation was

_repeated

100 times. The

power of the test for smaller data sets was

_investigated

for one

_relatively

small effect

(8)

Robustness

Investigation

of the

_type

I error and the _powerconsidered situations where either

H

o

or

_H

_l

was

_true,

_satisfying

all

_assumptions

in the models. The robustness of this

test and usefulness of the

_assumption

of fixation in

parents

for

_parameter

estimation

was

_investigated

for situations which violate 2

_assumptions:

- when there is

a covariance between error terms. This was induced

_by

simulation

of

_polygenic

variance

_by

model

_(2].

The total variance was held constant at

_100,

so

that the _{power of the}test could not

_change

due to a

_change

in total

_variance;

- when fixation of alleles is

not the case. The data were simulated

_by

model

_(1],

in which

_fp

and

_f

_m

were not

_equal

to 0 and

_1,

_resulting

in

_segregation

of alleles

in the

F

l

parents.

Firstly,

3

situations

were simulated where the average allele

frequency

remains 0.5. In that case

_only

the

_assumption

that all

F

1 parents

are

heterozygous

was violated.

_Secondly,

3 situations were simulated where the average

allele

_frequency

was not 0.5. In that case, the

assumption

that

_genotype

_frequencies

in

_F

₂

are

_{1/4, 1/2,}

and

_1/4

was also violated.

Inclusion

_{of F}

_l

data

A

_major

_genewhich starts

_segregating

in the

F

2

not

_only

renders the distribution

non-normal,

but also increases the

_phenotypic

variance in the

_F

₂

relative to the

_F

_i

_.

When

F

i

data are

_included,

this increase in variance may be taken as

_{supplementary}

evidence,

apart

from _any

_{non-normality,}

for the existence of a

major

_gene.

Assessing

the relative

_importance

of the 2 sources of information is useful so as to

_judge

the robustness of the model

_including

_F

_i

data. The effects on

_{non-normality}

and increased

_F

₂

variance due to the

_major

_{gene should therefore be}

_{distinguished.}

This

was

_{accomplished by simulating}

different residual variances in

_F

_I

and

F

2 .

Four situations were

_{investigated, combining}

all combinations of

_{non-normality}

in

_F

₂

and increased variance in

_F

₂

_{(table I).}

In

_general,

500

_F

_i

and 1000

_F

₂

observations

were simulated. For situation

3,

data sets with 1000

F

I

and 1000

F

2

observations

were also

_{investigated.}

Data for situations 1 and 3 were simulated

_by

model

(3],

(9)

RESULTS

Type

I error and

_parameter

estimates under the null

_hypothesis

Estimated

type

I _errors,based on 1 000

replicates,

have been

_given

in table II for

different sizes of the data set. Estimates

decreased,

and more or less stabilised when

the size of the data set exceeded 1 000

observations,

especially

for a nominal level

of

_10%,

which were most accurate. For these

_large

data sets,

however,

the

type

I

errors were too low

_(P

<

_0.01),

which means that critical values obtained from a

X

’2

distribution would

_provide

a too conservative test. For

_example,

_application

of the

X2

95-percentile

to data sets with 1 000 observations will not result in the

_expected

type

I error of

_5%,

but rather in a

_type

I error of x5

3%.

When no

_major

_geneeffect was

_present,

stil on _averagea considerable effect

could be found. Parameter estimates for the

_major

gene model have been

given

in

table

_III,

_simulating

_just

a

_normally

distributed error effect with variance 100. The

empirical

standard deviation for estimated t-values

_ranged

between

_7(N

=

100)

and

_5(N

=

2000) (not

_in

_table).

The

average estimate for t is therefore

biased,

and many of the individual estimates were

_{significantly}

different from zero if a t-test was

applied.

The average estimated d is

0.5,

which is

_expected

because the simulated

distribution was

_symmetrical.

Parameter estimates and _powerof the test

Results for the different situations studied under a

_major

_genemodel are in table IV.

The

_{x) 95-percentile}

was used as critical value for the test. The power reached over

95%

for additive effects

_(d

=

0.5)

with _at-value of

40,

which is 4 a

(residual

standard

_deviations).

For

_completely

dominant effects

_(d

=

1),

100%

power was

reached for an effect of t = 20

(2a).

Phenotypic

distributions for these 2 _{cases are}

unimodal,

although

not normal

_(fig

_1).

For small

_genetic

effects

_(t

!

10,

ie

₁

_Q

₎

t was

_{overestimated,}

in

_particular

when

(10)

d = 1 and was underestimated for d = 0.5. For d =

_0.5,

average estimates for t and d differed from the simulated values

_by

<

1%

when the power reached near

100%.

For d =

_{1, however,}

the bias in t was still

10%

when the

power had reached

100%.

This bias reduced

_gradually,

and was <

1%

for a

_genetic

effect of t = 40.

In

_figure

_{2 power}of the test is

_depicted

for

_varying

sizes of the data set. Two additive effects were

chosen,

with t = 25 and t = 35. Each

point

in the

figure

is on _averageof 100

replicates.

The _powerincreased with

increasing

number of

observations.

_Increasing

the number of observations > _{1000 gave}

_relatively

less

(11)

of observations this

_graph

is

_expected

to level off at the

_type

I error

_(nominally

5%),

but

_sampling

makes results somewhat erratic.

Robustness when

_{ignoring polygenic}

variance

Data

_following

model

_[2]

were simulated with d = 0.5 and _t= 35 and different

proportions

of

_polygenic

and residual variance. The data set contained 20 sires with 5 dams each and 10

_offspring

per

dam;

each situation was

repeated

100 times.

Estimated _parametersand

_resulting

_powerare in table V. Parameter estimates for

t and

d,

and the power of the test were not affected when a

_part

of the variance was

_polygenic.

The total estimated variance was

_equal

to the sum of simulated

variances.

Robustness when

_{ignoring segregation}

in the

parental

lines

Data

_following

model

_[1]

were simulated with d =

_0.5,

_{t =}

_35,

Q

2

= ₁₀₀and various

values for

_fp

and

_fm.

The

_genotype

_{probabilities}

in

_parents

_(F

₁

₎

and

_offspring

_(F

₂

₎

have been

_given

in table VI. For the first 3

_situations,

_genotype

_{probabilities}

in the

F

j

were

1/=1, 1/2

and

1/4,

as assumed under the fixation

_assumption.

For the

(12)

(13)

frequency

was not 0.5 on _average.

_High

_{average allele}

_frequencies

were

_simulated,

but because

_only

additive effects are

_considered,

results are

_equally

valid for low

allele

_frequencies.

The power remained

equal,

as

_long

as

_genotype

probabilities

in

F

2

remained

_{1/4, 1/2}

and

_1/4,

and parameter estimates are unbiased

_{(table VII).}

In case the allele

_frequency

did not _average

_{0.5, however,}

_parameter

estimates were

biased. The _powerof the test

_increased,

because in this situation the distribution

became skewed. The situation with d = 0.5 and t = 35 for data where the gene is fixed in

_parental

lines

_(table

_IV),

with a _powerof

82%,

_mayserve as a reference.

Inclusion

_{of F,}

data

Five

_hundred,

or

_{1 000,}

F

1

observations were also

_simulated,

with additive

_major

gene effects

(table VIII).

With no

_major

_geneeffect

_(t

= 0 and hence

a

7 71 2

=

0),

and

with

_equal

variances in

_F

_i

and

F

2 (situation 1)

the _averageestimated t was much

smaller than in the model

_{using only}

F

2

data

_{(table III).}

In the second situation

(14)

given

major

gene variance of 50. When

using

only

data,

F

2

the test had a _powerof

only

12%

for detection of an additive effect of t = 20

_{(table IV).}

When

_including

F

i

data, however,

the _powerwas

100%

(table VIII).

From the situations 3 and 4 considered in table

_{VIII, however,}

it becomes

_apparent

that when

F

I

data were

included,

the

_major

_genewas detected

_{only by}

its effect on

_variance,

_considering

a

power near the

_type

I error rate as irrelevant. When the variance in

_F

₂

increased

by

50%,

but when in fact no

_major

_genewas

_present,

a

_major

_genewas found in

100%

of the cases. For smaller increases of the variance

(10%)

_major

_geneswere still

detected,

and the

_probability

of detection increased with the size of the data set

(alternative

3* with more

_F

_i

observations).

A

_major

_genewas

_{totally undetectable,}

on the other

_hand,

when the total variance in

F

i

was

equal

to the total variance in

F

2 (situation

4).

This shows that the

_ability

to detect a

_major

_genecan even be

worsened when

_F

_I

data are included. If

_only

_F

₂

data were used a

_major

_genewith

similar effect was detected in

12%

of the cases

_{(table IV).}

DISCUSSION AND CONCLUSIONS

Type

I error

Nominal levels for

_type

I errors were based on Wilks

₍₁₉₃₈₎

who

_proved

_asymptotic

convergence of the likelihood ratio test statistic to

_a

_X

₂

distribution.

_Type

I errors

decreased and stabilised for

_larger

data

_sets,

as

_expected.

The estimated

_type

I

errors,

however,

were

_{significantly}

too low. It is

_unlikely

that the

_type

I error,

after

_having

first

_decreased,

would increase for even

larger

data sets as studied

here. It can be concluded

_therefore,

that

_type

I errors are

significantly

lower than

expected

in the

_asymptotic

_case,and that for

_large

data sets the likelihood ratio

test is conservative. It has been

_investigated

whether the constraint used on the

dominance coefficient could have caused the too low

_type

I errors.

_However,

this was not the case, because even with no

_constraint,

too low

_type

I errors were found

(15)

For the

_{investigation}

of power we have chosen to use the theoretical

asymptotic

quantiles, although they

were shown to

_give

a conservative test. The nominal level for the

type

I error is then an _upper

_bound,

and the

experimenter

still has a

reasonably

good

idea of the risk of

_making

a

_type

I error. When the actual

_type

I error would be above the

_{expected level, however,}

the test would become of less

use.

A second reason for still

_using

theoretical

_{asymptotic quantiles}

is that

_adapting

the test is difficult and of little

_practical

use. A

difficulty

_is,

for

_instance,

that

estimated

_quantiles

would be

_subject

to

_sampling

and the obtained

_point

estimate is therefore

_{only expected}

to

_give

the correct test.

_Therefore,

2

_{experimenters}

investigating

the same

_test,

will find different critical values and the test

_applied

will

depend

on the

_{experimenter.}

Also in

_practice

such a

procedure

would be difficult

to

_apply

since the calculated

_quantile

would

_only

hold for the same model and data

sets of similar size and structure.

Power of the test

Using

only F

2 data,

the power of this test was poor for additive effects

(dominance

coefficient =

0.5).

This can be

_{explained by}

the

_resulting

_symmetrical

distribution

which is similar to the distribution under

_H

_o

_.

In this case, the

genetic

effect has

to be about 4Q to be

_detectable,

which

_corresponds

to a

_heritability

of 0.67

in

the

F

2 generation.

When the dominance coefficient is

_1,

an effect of 2Q was detectable.

These results are based on data sets with 1000

observations,

but it was shown that

the power decreased

dramatically

for smaller data sets.

Power increased when

_F

_I

data was included in the

_analysis,

and additive effects of 2a could be detected. In that case the increase in variance in

F

2 ,

caused

by

the

_major

_gene,was taken as an

_important

indication for the _presenceof a

_major

gene. The power to detect a

_major

_{gene in}

F

2

data _mayalso increase if alleles were

not fixed in the

_{parental lines,}

or

_{alternatively F}

₃

_,

instead of

_F

₂

_,

data were used. This

corresponds

more to the situation in a usual

_population,

where

_{between-family}

variation will arise. For

_F

₃

_data,

for

_example,

when _purelines were

_homozygous,

the allele

_frequency

will be

_0.5,

and

_parents

will be in

_{Hardy-Weinberg equilibrium.}

For such a

_situation,

Le

_Roy

₍₁₉₈₉₎

found a _powerof

25%

for an additive effect of 2u in a data set of 400 observations

₍₂₀

sires with 20 half-sib

_{offspring each).}

In

_figure

_2,

_{the power}for a data set of similar size can be seen to be

_{only !}

10%

for an even

_larger

effect of

2.5!(t

=

25).

This indicates that _anincrease in power

may

be

expected

when the

F

3 generation

is

observed,

despite

the facts that more

parameters

have to be

estimated,

and that

_parents’

_genotypes

are no

_longer

known.

The power for detection of a

_major

_geneis related to the

_unexplained

variance

in the model of

_analysis.

The inclusion of fixed and

_polygenic

effects will therefore make the

_major

_geneeasier to

_{detect, provided}

that all these effects can be

accurately

estimated.

Parameter estimates

For additive effects simulated

_(d

=

0.5),

bias for the average estimated

genetic

effect

(16)

For dominant effects

_(d

=

1),

however,

_t_wasoverestimated

by

10%

when the power

for detection of a

_major

_genereached

100%.

This overestimate is

_probably

related to

the underestimate for

_d,

which resulted from the

_applied

constraint. As

_mentioned,

this constraint was

_applied

to

_prevent

t from

_going

to _zero,at which

_point

d tended

to go to

_infinity.

When such a constraint was not

_applied

_with,

for

_instance,

an

effect of t = 10 and d =

_1,

analyses

gave in 100

replicates

an _{average estimated}d

of 2.93. This is an _averageoverestimate of !

200%.

The _averageestimate

_using

the

constraint was

_0.93,

_showing

that indeed better estimates were obtained under the

restriction,

even when the true value was on the border of the allowed

_parameter

space. In

practice,

of _course,overdominance cannot be excluded and

parameter

estimates could be

_compared

with and without this constraint. A

small,

near _zero,

estimate for t and a

_large

estimate for d would

_suggest

a

possible

overestimation

of d.

For very small or absent

_effects,

the ML estimates were

considerably

biased.

In this

_situation,

the

_{asymptotic properties}

of ML

_estimates,

ie

_consistency,

are

far from

_being

attained. In the absence of a

_major

_{gene, average estimates}were

presented

for

_increasing

size of the data set. This showed that the average estimate

decreased,

and will

_probably

reach the true value when the number of observations is

very much

larger.

Bias of ML estimates in finite

_samples

also resulted in

_significant

t-values when no effect was _present.This indicates that the _presenceof a

_major

_gene should not be

_judged

_by

the estimates and their standard errors. The standard

errors discussed here were

_empirical

standard errors. In

_practice

such standard errors will have to be obtained

_using

the inverse of an estimated Hessian

_matrix,

or some other

_{quadratic approximation}

of the likelihood curve near the

optimum.

Using

the estimated Hessian

_matrices,

we found

_roughly

the same standard _errors,

although they

were not very accurate. In our

study,

the

quasi-Newton algorithm

was started close to the

_optimum

and not

_enough

iterations are then carried out to

estimate the Hessian matrix

_accurately.

Robustness of model and test

Inclusion of

F

l

data results in a

_poorly

robust test when differences in variances would arise between the

_F

_i

and

_F

₂

due to other causes than a

_major

_{gene. An}

increase in variance from

_F

_i

to

_F

₂

can result in a

_{putative major}

_gene

being

detected. An increase in variance of

_10%,

for

_instance,

_gave

25%

false detections when 1000

_F

_i

and 1000

_F

₂

observations were combined. Such increases are not

unlikely,

due

_to,

for

_instance,

_polygenes.

The

_major

_genetest is then

_merely

a

test for

_homogeneous

variance in

_F

₁

and

_F

_z

_.

The inclusion of

_F

_l

data could also

worsen the detection of a

_major

_gene,when the environmental variance in

F

2

was

less. Therefore any differences in

variance,

due to other causes than the

_major

gene

effect,

will bias the

_parameter

estimates. Also in a model that allows for

_segregation,

such biases will remain.

’

It was shown that the model is robust when

_polygenic

effects were

_ignored.

This can be

_explained

_by

the fact that the test uses

_only

the

_{non-normality}

of the

distribution as a criterion. It must be noted however

_that,

when

_polygenic

effects

can be

_{accurately estimated,}

_including

a

_polygenic

effect in the model will increase

(17)

Another

_aspect

of robustness concerns the

_assumption

of fixed alleles in

_parental

lines. It was shown that

_parameter

estimates were not _{biased when alleles}

segre-gated,

as

_long

as the average

_frequency

in the 2 lines was 0.5. In that case the

as-sumed

_fitting

_proportions

_{1/4, 1/2}

and

_1/4

are still correct. _{If the average}

_frequency

in

_parental

lines differed from

_0.5,

t was underestimated

and,

because skewness was

introduced,

estimates for d deviated from 0.5. This second situation is more

_likely

to occur than the situation where the _average

frequency

is

exactly

0.5. Because

it could be difficult to

_justify

the fixation

_assumption

a

_priori,

_application

of a

more

_general

model that allows for

_segregation

in

_parental

lines

_might

have to be considered.

A final

_aspect

of robustness concerns

_{non-normality}

of the distribution not due

to a

_major

gene. As

stated earlier a mixture distribution is fitted and the detection

of a

_major

_{gene in}

F

z

data,

_assuming

fixation,

relies

_solely

on the

_{non-normality}

caused

_by

the

_major

_{gene. This}means that in fact

_only

a

_significant

_{non-normality}

is proven. The method would therefore be

poorly

robust

against

any

non-normality

due to another cause. The robustness

_might

be

_{improved using}

data in which

alleles

_segregate

in

_parents.

This is

_guaranteed

in

_F

₃

_data,

but _mayalso arise in

F

a

data,

when alleles were not fixed in

_parental

lines. If

_segregation

in

_parents

is the case, evidence for a

_major

_{gene is}no

_{longer only}

in the

_{non-normality}

of the overall

distribution,

but also for instance in

_{heterogeneous}

within

_family

variances. Therefore a model that allows for

_segregation

is not

_{only preferred}

to increase

power, but is also

preferred

to

_improve

robustness.

ACKNOWLEDGMENTS

This research was

_supported

_financially

_by

the Dutch Product Board for Livestock and

Meat,

the Dutch

_Pig

Herdbook

_Society,

Bovar

BV,

Euribrid

BV,

Fomeva BV and the VOC Nieuw-Dalland BV. Profs

_Brascamp

and Grossman and Drs Van Arendonk and Van Putten are

_acknowledged

for

_helpful

comments and editorial

_suggestions.

Comments of 2 referees have

_helped

to shorten the

manuscript

and to

_improve

the discussion section.

REFERENCES

Elston

_RC,

Stewart J

₍₁₉₇₁₎

A

_general

model for the

_genetic

_analysis

of

_pedigree

data. Hum Hered

_21,

523-542

Elston

_RC,

Stewart J

₍₁₉₇₃₎

The

_analysis

of

_quantitative

traits for

_{simple genetic}

models from

_parental,

F

1

and backcross data. Genetics

73,

695-711 1

Elston RC

₍₁₉₈₄₎

The

_{genetic analysis}

of

_quantitative

trait differences between two

homozygous

lines. Genetics

_108,

733-744

INISL

₍₁₉₈₄₎

Library Reference

Manual Edition 9.2. International and Statistical

Libraries, Houston,

TX ,

Le

_Roy

P

₍₁₉₈₉₎

Methodes de detection de

_genes

_majeurs;

_application

aux animaux

domestiques.

Doctoral

Thesis,

Universite de

Paris-Sud,

Centre

_d’Orsay

Morton

_NE,

MacLean CJ

₍₁₉₇₄₎

_Analysis

of

_family

resemblance III.

_Complex

segregation

of

_quantitative

traits. Am J Hum Genet

_26,

489-503

Wilks SS

₍₁₉₃₈₎

The

_large

_sample

distribution of the likelihood ratio for