HAL Id: hal-00893978
https://hal.archives-ouvertes.fr/hal-00893978
Submitted on 1 Jan 1993
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Inference about multiplicative heteroskedastic
components of variance in a mixed linear Gaussian
model with an application to beef cattle breeding
M San Cristobal, Jl Foulley, E Manfredi
To cite this version:
Original
article
Inference about
multiplicative
heteroskedastic
components
of variance
in
a mixed
linear
Gaussian
model with
anapplication
to
beef cattle
breeding
M
San
Cristobal
JL
Foulley
E Manfredi
1
INRA, Station de Genetique Quantitative et Appliquée,
78352 Jouy-en-Josas Cedex;
2
INRA, Station d’Amelioration G6n6tique des Animaux, BP, 27, 31326 Castanet-Tolosan Cedex, France
(Received
28 April 1992 ; accepted 23 September1992)
Summary - A statistical method for identifying meaningful sources of heterogeneity of residual and genetic variances in mixed linear Gaussian models is presented. The method is
based on a structural linear model for log variances. Inference about dispersion parameters is based on the
marginal
likelihood after integrating out location parameters. A likelihood ratio test using themarginal
likelihood is also proposed to test forhypotheses
about sources of variation involved. A Bayesian extension of the estimation procedure of the dispersion parameters is presented which consists of determining the mode of theirmarginal
posterior distribution using log inverted chi-square or Gaussian distributions as priors. Procedurespresented in the paper are illustrated with the analysis of muscle development scores
at weaning of 8575 progeny of 142 sires in the Maine-Anjou breed. In this
analysis,
heteroskedasticity is found, both for the sire and residual components of variance.
heteroskedasticity / mixed linear model / Bayesian technique
R.ésumé - Inférence sur une hétérogénéité multiplicative des composantes de la
variance dans un modèle linéaire mixte gaussien: application à la sélection des
bovins à viande. Une méthode statistique est présentée, capable d’identifier les sources
significatives d’hétérogénéité de variances résiduelles et génétiques dans un modèle linéaire mixte gaussien. La méthode est
fondée
sur un modèle structurel de décomposition dulogarithme des variances. L’inférence concernant les paramètres de dispersion est basée sur la vraisemblance marginale obtenue après intégration des paramètres de position. Un
*
Correspondence
and reprints**
test du rapport des vraisemblances utilisant la vraisemblance marginale est aussi proposé
afin de tester des hypothèses sur différentes sources de variation. Une extension bayésienne
de la procédure d’estimation des paramètres de dispersion est présentée; elle consiste en la maximisation de leur distribution marginale a posteriori, pour des distributions a priori
log x
2
inverse ou gaussienne. Les procédures présentées dans ce papier sont illustrées parl’analyse de notes de pointages sur le développement musculaire au sevrage de 8 575 jeunes veaux de race Maine-Anjou, issus de 142 pères. Dans cette analyse, une hétéroscédasticité a été trouvée sur les composantes père et résiduelle de la variance.
hétéroscédasticité / modèles linéaires mixtes / techniques bayésiennes
INTRODUCTION
One of the main concerns of
quantitative geneticists
lies in evaluation of individualsfor selection. The statistical framework to achieve that is
nowadays
the mixed linearmodel
(Searle, 1971),
usually
under theassumptions
ofnormality
andhomogeneity
of variances. The estimation of the location parameters is
performed
with BLUE-BLUP(Best
Linear UnbiasedEstimation-Prediction),
leading
to the well-known Mixed ModelEquations
(MME)
of Henderson(1973),
and REML(acronym
for REstricted -or REsidual- MaximumLikelihood)
turns out to be the method ofchoice for
estimating
variance components(Patterson
andThompson,
1971):
However,
heterogeneous
variances are often encountered inpractice,
eg for milkyield
in cattle(Hill
etal,
1983; Meinert etal,
1988;Dong
andMao, 1990;
Visscher etal,
1991;Weigel,
1992)
for meat traits in swine(Tholen,
1990)
and forgrowth
performance
in beef cattle(Garrick
etal,
1989).
Thisheterogeneity
ofvariances,
also calledheteroskedasticity
(McCullogh, 1985),
can be due to manyfactors,
egmanagement
level,
genotype x environment interactions,segregating major
genes,preferential
treatments(Visscher
etal,
1991).
Ignoring heterogeneity
of variance may reduce thereliability
ofranking
andselection
procedures
although,
in cattle for instance, dam evaluation islikely
tobe more affected than sire evaluation
(Hill,
1984;Vinson,
1987; Winkelman andSchaeffer,
1988).
To overcome this
problem,
3 main alternatives arepossible.
First,
atransfor-mation of data can be
performed
in order to match the usualassumption
ofho-mogeneity
of variance. Alog
transformation wasproposed by
several authors inquantitative
genetics
(see
eg Everett andKeown,
1984; De Veer and VanVleck,
1987; Short et
al, 1990,
for milkproduction
traits incattle).
However,
while ge-netic variances tend tostabilize,
residual variances oflog-transformed
records arelarger
in herds with the lowestproduction
level(De
Veer and VanVleck,
1987; Boldman andFreeman,
1990; Visscher etal,
1991).
’The second alternative is to
develop
robust methods which are insensitive tomoderate
heteroskedasticity
(Brown, 1982).
The last choice is to take
heteroskedasticity
into account. Factors(eg
region,
herd,
year,parity,
sex)
toadjust
forheterogeneous
variances can be identified. Butsuch a stratification generates a very
large
number of cells(800 000
levels of herd x year in the French Holsteinfile)
with obviousproblems
ofestimability.
Hence,
via a
modelling
(or structural)
approach
so as to reduce the parameter space,by
appropriate
identification andtesting
ofmeaningful
sources of variation of suchvariances.
The model for the variance components is described in the Model section. Model
fitting
and estimation of parameters based onmarginal
likelihoodprocedures
arepresented
in the Estimationof Parameters,
followedby
a test statistic inHypothesis
Testing.
ABayesian
alternative to maximummarginal
likelihood estimation ispresented
in ABayesian
Approach
to a Mixed Model Structure In the Numericalapplication
section,
data on French beef cattle areanalyzed
to illustrate theprocedures
given
in the paper.Finally,
some comments on themethodology
aremade in the Discussion and Conclusion.
MODEL
Following Foulley
et al(1990,
1992)
and Gianola et al(1992),
thepopulation
isassumed to be stratified into I
subpopulations,
or strata(indexed
by
i =1, 2, ... ,
I)
with an
(n
i
x1)
data vector yi,sampled
froma
normal distributionhaving
meani ii and variance
R.
i
=a2 ei I&dquo;
i
.
Given ii
i
andR
i
Following
Henderson(1973),
the vector IIi isdecomposed according
to a linearmixed model structure:
where
i
X
and Z; are(n
i
x p)
andx q
(n
i
i
)
incidencematrices,
corresponding
to fixedJ3 (p
x 1 )
and randomu
i
(q
i
x 1 )
effectsrespectively.
Fixed effects can be factors orcovariates,
but it is assumed in thefollowing
that,
without loss ofgenerality, they
represent factors.In the animal
breeding
context, ui is the vector ofgenetic
meritspertaining
tobreeding
individuals used(sires
spread by
artificialinsemination)
or present(males
and
females)
in stratum i. These individuals are related via the so-called numeratorrelationship
matrixA
i
,
which is assumed known andpositive
definite(of
rankq
i
).
Elements of ui are not
usually
the same from one stratum to another. A borderline case is the &dquo;animal&dquo; model((auaas
andPollak,
1980)
where animalswith records are
completely
different from one herd to another.Nevertheless,
such individuals aregenetically
related across herds.Therefore,
model[3]
has tobe refined to take into account covariances among elements of different
u!s.
Asproposed by
Gianola et al(1992),
this can beaccomplished by relating
Ui to awith A
being
the overallrelationship
matrix of rank q,relating the
q breeding
Ianimals involved in the whole
population,
with q xL
qj.i=l
Thus,
S
i
is an incidence matrix with 0 and 1 elementsrelating the q
i
levels ofu
* present in the ith
subpopulation
to the whole vector(q
x1)
of u elements. Forinstance, if stratification is made
by
herdlevel,
the matricesS
i
andS
i’
(i !
i’)
do not share any non-zero elements in theircolumns,
since animalsusually
haverecords
only
in one herd. On the contrary, in a siremodel,
agiven
sire k may haveprogeny in 2 different herds
(i,
i’)
thusresulting
in ones in both kth columns ofS
i
and
Si.
Notice that in this
model,
any genotype x stratum interaction is dueentirely
toscaling
(Gianola
etal,
1992).
Formulae
[2],
(3!,
[4]
and[5]
define the model for means; a further step consistsin
modelling
variance components{!e! !i=1,...1
and
{Q!. },!=1,...t
in a similar way,ie
using
a structural model.’ ’
The
approach
taken here comes from thetheory
ofgeneralized
linear modelsinvolving
the use of a link function so as to express the transformed parameters with a linearpredictor
(McCullagh
andNelder,
1989).
For variances, a commonand convenient choice is the
log
link function(Aitkin,
1987;
Box andMeyer, 1986;
Leonard,
1975;
Nair andPregibon,
1988):
where
wey
andw’ .
are incidence row vectors of sizek
e
andk
u
,
respectively,
corresponding
todispersion
parameters fg and !u. These incidence vectors canbe a subset of the factors for the mean in
(2!,
but exogeneous information is alsoallowed.
Equations
[6]
and[7]
define the variance component models. These models can be rewritten in a more compact form as follows.Let
y =
(y!,..., y!,...,
y’)’
be the n x 1 vector of data for the wholepopulation,
I
with n
= ! ni,
i=lIIi
xil
3
+
0&dquo;&dquo;izisiu*
11
=
(II!,... ,11:,... ,
ll
’)’
be the mean vector of y,I
R
= ®
Ri be the variance-covariance matrix of y, with ?representing
thei=l
direct sum
(Searle, 1982).
Equation
[1]
can then be rewritten as:In the same way,
[2]
becomes:X!
the(n
i
xp)
incidence matrix defined in!2J;
Z= (Z1 ,...,ZZ ,...,ZI ) ,
Z!
=o,,,iZ
i
S
i
the(n
i
xq)
&dquo;incidence&dquo; matrixpertaining
tou
*
,
T = (X, Z
*
)
and0 =
Q3’,
U
*
’ I’.
The vector 0 includes p + q location parameters. The matrix T can be viewed as an &dquo;incidence&dquo;
matrix,
but whichdepends
here on thedispersion
parameters Tu
through
the variancesQ
ua.
Both variance models can also be
compactly
written as:The ke+ ku
dispersion
parameters !e and y! can be concatenated into a vector(
T
=(T!, T!)’
withcorresponding
incidence matrix W =W
e
EÐ W
u’
Thedispersion
model then reduces to:
where
a
2
=(CF e 2&dquo; cF! 2’ )’
and1n a
2
is
asymbolic
notation for(In a;
1
’
...Inaejl 2
In a!1 ’ , .. , In a![)’.
ESTIMATION OF PARAMETERS
In
sampling theory,
a way to eliminate nuisance parameters is to use themarginal
likelihood
(Kalbfleisch, 1986). &dquo;Roughly
speaking,
thesuggestion
is to break the data in two parts, one, part whose distributiondepends only
on the parameter ofinterest,
and another part whose distribution may welldepend
on the parameterof interest but which
will,
inaddition, depend
on the nuisance parameter.!...!
This second part
will,
ingeneral,
contain information about the parameter ofinterest,
but in such a way that this information isinextricably
mixed up withthe nuisance
parameter&dquo;
(Barnard, 1970).
Patterson andThompson
(1971)
usedthis
approach
forestimating
variance components in mixed linear Gaussian models. Their derivations were based on error contrasts. Thecorresponding
estimator(the
so-called
REML)
takes into account the loss indegrees
of freedom due to theAlternatively,
Harville(1974)
proved
that REML can be obtainedusing
thenon-informative
Bayesian paradigm.
According
to the definition ofmarginalization
inBayesian
inference(Box
andTiao, 1973;
Robert,
1992),
nuisance parameters areeliminated
by
integrating
them out of thejoint posterior
density.
Keeping
in mind that thesampling
and the non-informativeBayesian approaches
give
rise to the same estimationequations,
we have chosen theBayesian techniques
for reasons of coherence and
simplicity.
The parameters of interest are here the
dispersion
parameters r, and the locationparameters 6 appear to be nuisance parameters. Inference is hence based on the
log marginal
likelihoodL(
T
; y)
of r:
An
estimator y
of T isgiven
by
the mode ofL(
T
;
y):
where r is a compact part of
R
k
e
+k
u.
This maximization can be
performed using
a resultby
Foulley
et al(1990, 1992)
which avoids theintegration
in[13].
Details can be found in theA
PP
endix.
Thisprocedure
results in an iterativealgorithm. Numerically,
let[t]
denote the iterationt; the current estimate
9
[Hl]
of r
iscomputed
from thefollowing
system:where
i
lt]
is the current estimate at iteration t, W the incidence matrix defined in!12!,
Q
M
is theweight
matrixdepending
on0
and onê
[t]
,
which are the solution andthe inverse coefficient matrix
respectively
of the current system in 0(this
system is describednext),
z!
is the score vectordepending
on6
andC!.
Elements of
Q
l
’
l
andi
lt
)
aregiven
in theAppendix.
The second system is:
where
i
[t]
is the &dquo;incidence&dquo; matrix T defined in[9]
and evaluated at T =y
[t]
;
ft- 1
1
’]
is theweight
matrix evaluated at T= y[
t]
,
with R defined as in[8];
E-C
0
0
1
)
and takes into account theprior
distribution of u* in!5!.
Regarding
computations
involved in!15!,
2 types ofalgorithms
can be considered as in San Cristobal(1992).
A second orderalgorithm
(Newton-Raphson type)
convergesrapidly
andgives
estimates of standard errors ofy,
butcomputing
timecan be excessive with the
large
data setstypical
of animalbreeding problems.
As shown inFoulley
et al(1990),
a first orderalgorithm
can beeasily
obtainedby
approximating
the(a
matrix in[15]
by
its expectation component(Qa!,E
in theappendix
notations).
This EM(Expectation-Maximization;
Dempster
etal,
1977)
algorithm
converges moreslowly,
but needs fewer calculations at each iterationand,
on the
whole,
less total CPU time forlarge
data sets.HYPOTHESIS TESTING
An
adequate modelling
ofheteroskedasticity
in variance componentsrequires
aprocedure
forhypothesis
testing.
LetH
o
:
H !
= 0 be the nullhypothesis
withH
being
a full(row)
rank matrix with row sizeequal
to the number oflinearly
independent
estimable functions of Tdefining
H
o
,
andH
1
its alternative. Forexample,
one can be interested intesting
thehypothesis
ofhomogeneity
of residualvariances
H
o
:
u2 e
i
=exp
(-y,,)
= Const for all i.Letting
Ye =
f
7
R
,
&dquo;fe2-’7R. - - -, Yel&dquo;f
R
}
f
with &dquo;fRbeing
thedispersion
parameter for the residual variance in the firststratum taken as reference.
H
o
can beexpressed
ase
e
r
H
= 0, or(H
e
, 0
h
= 0with
He
=(O(
I
-I
)x
l
,,
I
-1
).
Let
M
o
andNft
be the modelscorresponding
toH
o
andH
1
,
respectively.
SinceP(
YIT
)
=e u
themarginal
likelihood can beinterpreted
as a likelihoodof error contrasts
(Harville, 1974),
hence the likelihood ratio test based on themarginal
likelihood can beapplied:
Under
H
o
, A
isasymptotically
distributedaccording
toa
X
2
withdegrees
of freedomequal
to the rank of H. In the normal case,explicit
calculation ofL(
T
;
y)
is
analytically
feasible:A BAYESIAN APPROACH TO A MIXED MODEL STRUCTURE
One can be interested to
generalise
Henderson’s BLUP for subclass means(
11
=T9)
to
dispersion
parameters(ln a
2
=W7 )
ieproceed
asif
T
had a mixed modelstructure
(Garrick
and VanVleck,
1987).
To overcome thedifficulty
of a realisticinterpretation
of fixed and random effects forconceptual populations
of variancesfrom a
frequentist
(sampling)
perspective,
one canalternatively
useBayesian
procedures.
It is then necessary toplace
suitableprior
distributions ondispersion
In linear Gaussian
methodology,
theoretical considerationsregarding conjugate
priors
or fiducial arguments lead to the use of the inverted gamma distribution as aprior
for a variancea
2
(Cox
andHinkley,
1974;
Robert,
1992).
Such adensity
depends
onhyperparameters
77 ands
2
.
The former conveys the so-calleddegrees
of
belief,
and the latter is a location parameter. The ideasbriefly exposed
in thefollowing
are similar to those described inFoulley
et al(1992).
Hence,
aprior
density for y
=ln
Q
2
can
be obtained as alog
invertedgamma
density.
As a matter offact,
it is moreinteresting
to consider theprior
distribution of v =&dquo;y —
T
°,
withq°
= Ins
2
,
iewhere
r(.)
refers to the gamma function.Let us consider a K-dimensional &dquo;random&dquo; factor v such that
Vk
1
77k
(k
= 1, ...K)
is distributed as a
log
inverted gamma’
)
1]k
(
l
InG-
Since the levels of each randomfactor are
usually exchangeable,
it is assumed that 1]k = 1] for every k in{1, ... K}:
For v
k
in[20]
smallenough,
the kernel of theproduct
ofindependent
distributionshaving
densities as in[19]
can beapproximated (using
aTaylor expansion
of[19]
about v
equal
to0)
by
a Gaussiankernel,
leading
to thefollowing
prior
for v:As
explained by Foulley
et al(1992),
thisparametrization
allowsexpression
of the T vector ofdispersion
parameters under a mixed model type form.Briefly,
from
[19]
onehas
1
=1
°
+ vor
1 =
P
’oS
+ v if one writes the locationparameter
-to
= InS
2
as a linear function of some vector 8 ofexplanatory
variables(p’
being
a row incidence vectorof coefficients).
Extending
thiswriting
to several classificationsin v leads to the
following general
expression:
where P and
Q
are incidence matricescorresponding
to fixed effects E and random effects v,respectively,
with[20]
or[21]
asprior
distribution for v.Regarding dispersion
parameters T, it is thenpossible
toproceed
as Henderson(1973)
did for location parameters 11, ie describe them with a mixed modelstructure.
Again,
as illustratedby
formula[22],
the statistical treatment of thismodel can be
conveniently implemented
via theBayesian
paradigm.
In
fact,
equations
[22]
define a model on residual variances:where
Pe, Pu, Qe, Qu
are incidence matricescorresponding respectively
to fixed effects5
e
, <!
and random effects ve =(v!, Ve2, ... ,
v!,...)’,
Vu =(v!, V
u2,
’
’ ’ ,
vu,
;
...)!
with,
for thejth
and kth random classification in ve and vurespectively,
Let 11 =
(11!, 11!)’
with 11e ={77ej}
and
1 1
u =
{77Uk}
be the vectors ofhyperparameters
introduced in the variance component models[23], [24],
[25]
and[26].
Anempirical Bayes procedure
is chosen to estimate the parameters. Thehyperparameters,
11(or § =
(!e, !u)’)
are estimatedby
the mode of themarginal
likelihood of these
hyperparameters
(Berger,
1985; Robert,
1992):
Then,
thedispersion
parameters are obtainedby
the mode of theposterior
density of
T
given
thehyperparameters equal
to their estimates:or
similarly
for
t.
Maximization in
[27]
and[28]
can beperformed
with aNewton-Raphson
or an EMalgorithm, following
ideas in the Estimationof
parameters,Unfortunately,
thealgorithm
derived from[27]
iscomputationally demanding,
since it involvesdigamma
andtrigamma
functions. On the otherhand,
an EMalgorithm
derivedfrom
[28]
has the same form as the EM-REMLalgorithm
for variance components.It
just
involves the solution and the inverse coefficient matrix of the system in Tat iteration
(t).
This latter system is similar to(15),
but it takes into account theinformative prior on the
dispersion
parameters. In the case of a Gaussianprior,
thissystem can be written as
where
r
is the matrixI- (!) = ( 0
i
.
)
evaluated
at the current estimateI
of
!,
tanking
into account thepriors
viaA(!) =
Var(v’, v’)
=A
e
?
A!
withA,
=0!1!,
andA!, _
(1)
I
K.,,.
i ’ k
NUMERICAL APPLICATION
Sires of French beef breeds are
routinely
evaluated for musculardevelopment
(MD)
based on
phenotypic performance
of their male and female progeny.Qualified
personnel subjectively classify
the calves at about 8 months of age, with MDscores
ranging
from 0 to 100. Variance components and siregenetic
values arethen estimated
by
applying
classicalprocedures,
ie REML and BLUP(Henderson,
1973;
Thompson,
1979),
to a mixed modelincluding
the random sire effect and aset of fixed effects described in table I. The second factor listed in table
I,
conditionscore
(&dquo;Condsc&dquo;),
accounts for theprevious
environmental conditions( eg
nutrition viafatness)
in which calves have been raised.Some factors among those described in table I may induce
heterogeneous
variances. In
particular,
different classifiers areexpected
to generate notonly
different MD means, but different MD variances as well.
Thus,
the usual sire modelwith
assumption
ofhomogeneous
variances may beinadequate.
Thishypothesis
wastested on the
Maine-Anjou
breed. After elimination of twins and furtherediting
described in table
I,
theMaine-Anjou
file includedperformance
records on 8 575progeny out of 142 sires
(&dquo;Sire&dquo;)
recorded in 5regions
(&dquo;Region&dquo;)
and 7 years(&dquo;Year&dquo;).
Other factors taken into account were: sex of calves(&dquo;Sex&dquo;),
age atscoring
(&dquo;Age&dquo;),
claving
parity
(&dquo;Parity&dquo;),
month of birth(&dquo;Month&dquo;)
and classifier( &dquo;Classi&dquo; ).
In most strata defined as combinations of levels of theprevious factors,
only
one observation was present.Preliminary analysis
A
histogram
of the MD variable can be found infigure
1. The distribution of MDseems close to
normality,
with a fairPP-plot
(although
the use of thisprocedure
is somewhatcontroversial),
and skewness and kurtosis coefficients were estimated asnull
hypothesis,
while others did notreject
it,namely
Geary’s
u, Pearson’s tests forskewness and kurtosis
(Morice, 1972)
at the 1% level.Bartlett’s test for
homogeneity
of variances wascomputed
for each of the first8 factors described in table I. Results in table IIa indicate strong evidence for heteroskedastic variances among subclasses of each factor considered in this data set.
The usual sire model with all factors from table I in the mean
model,
and variance componentsestimated
by
EM-REML,
wasfitted,
leading
to estimates6d
=70.1l,a,2,
=6.91,
andh
2
=46fl /(6d
+
3!)
=
0.36. Note that this model isequivalent,
in ournotation,
to thehomogeneous
model in fg and Yu.Search for a model for the variances
The
following
additive mean modelM
B
was considered as truethroughout
the wholeanalysis
A forward selection of factors strategy was chosen to find a
good
variance modelMy
but in 2 stages; a backward selection strategy would have been difficult toimplement
because of thelarge
number of models to compare and the small amount of information in some stratagenerated by
those models.(i)
sincea2
represents
> 90% of the totalvariation,
it was decided to model that componentfirst, assuming
the ru- parthomogenous;
’(ii)
the &dquo;best&dquo;T
u
-model
was thereafter chosen whilekeeping unchanged
the&dquo;best&dquo;
T
e-model.
The different nested models were fitted
using
the maximummarginal
likelihoodratio test
(MLRT) A
described in!17J.
During
the first stage(i),
thehomogeneous
sire variance wasestimated,
forcomputational
ease, with an EM-REMLalgorithm,
and the Te parameter estimateswere calculated as in
Foulley
et al(1992).
This strategyleads,
of course, to thesame results as those obtained with the
algorithm
described in the Estimationof
parameters.
The first step consisted of
choosing
the best one-factor variance model from resultspresented
in table lib. The next steps, ie the choice of anadequate
2-factormodel,
and then of a 3-factormodel,
etc, are summarised in table III.Finally,
theThe model can also be
simplified
aftercomparing
estimates of factorlevels,
andthen
collapsing
these levels if there are notsignificantly
different. For the(ii)
stage, the &dquo;best&dquo;r
u
-model
was the model(see
tableIV):
We were not able to reach convergence of the iterative
procedure
for the models(Mo,
M.y
e
,
Classi)
and(Mo,
M!(,,
Region),
although
some levels of the Classi factor werecollapsed.
Thisphenomenon
is related to a strong unbalance of thedesign:
forinstance,
one classifier noted the calves ofonly
4 sires,making
quite
impossible
acoherent estimation of
Classi-heterogeneous
sire variances. The other factors(except
Year)
had nosignificant
effect on the variation of the sire variances. Because ofimbalance,
the modelgave
unsatisfactory
results egheritability
estimates greater than one.Results
Estimates of the
dispersion
parameters for the selected modeldesignated
hereas
(MB, M!.!, M!.!)
are shown in table Va. Asexpected,
theT
,-estimates
of the(Mo, M&dquo;
f
c’
homogeneity)
model,
ie of the bestr
e
-model
withonly
onegenetic
variance,
arequite
similar to the-estimates
e
T
of the(Mo,
M
7e
,
My&dquo;)
model(table
Va).
In contrast,T
e
-estimates
of the(Mo, M&dquo;
f
c’
homogeneity)
model,
with:M!c :
Classi(random)
+ Condsc + Year(random)
+ Month(random)
[35]
are different for the &dquo;random&dquo; factors
(see
tableVb).
Estimated
!e,Year
= 0.009 and!e,Month
= 0.0024respectively,
oralternatively
using
% values ofthe coefficient of variation for
ae,
(!e
CV
2
)
Class
i
e
,
CV
=14.5%, CV,,
Yar
= 9.5%and
CV
e
,
Month
= 4.9%respectively.
Infact,
the smaller the cell size(n
i
),
and thesmaller
CV,
thegreater
theshrinkage
of thesample
estimates(6
f
)
toward themean variance
(3 )
since theregression
coefficient toward this mean in theequa-tion
Q2
= õ’2
+ b(6
i
2
_
õ’2)
isapproximately
b =n
d[
n
i
+(2/CV!)]
with!7 =
2/CV
2
:
see also Visscher and Hill
(1992).
The
genetic
variation in heifers turns out to be less than one half what it is inbulls even
though
thephenotypic
variance wasvirtually
the same. This may be dueto the fact that classifiers do not score
exactly
the same trait in males(muscling)
asin females
(size and/or fatness).
It may also suggest that theregime
of male calvesis
supplemented
with concentrate.Location parameters are
compared
infigures
2a-d under differentdispersion
models,
through
scatterplots
of estimates of standardized sire merits(u
*
).
Indexesbased on &dquo;subclass means&dquo;
(V
i
=y
i
,
i = 1, ...I,
withhomogeneous
variances)
andthose based on the &dquo;sire model&dquo; under the
homogeneity
of varianceassumption
arefar away from each other
(see
fig
2a).
Figure
2a isjust
a reference ofdiscrepancy,
which illustrates the
impact
of the BLUPmethodology.
Whenheterogeneity
isintroduced among residual
variances,
sires’genetic
values do not vary toomuch,
asshown in
figure
2b.Modelling
of thegenetic
variances has alarger
impact
on the siregenetic
values(see
figure
2c)
thanmodelling
of residual variances.Finally,
theBayesian
treatment ofr
e
-parameters by introducing
random effects in the model(M
B
,
M&dquo;
YJ
does not have any influence on the siregenetic
merits(fig 3d).
Evaluation of sires can be biased if true
heterogeneity
of variance is not taken intoaccount. As shown in table
VI,
sire number 13 went down from the 16th to the 24thposition because
his calves were scoredmostly by
classifier no 1 who uses alarge
scale of notation
(see
T
-estimates
in tableV).
On the otherhand,
sire 103 went upfrom the 25th to the 14th
place
since thecorresponding
Classi and Condsc levels have low residual variance(for
the other factor levelsrepresented,
the varianceswere at the
average).
For the same reason, the siregenetic
merits were also affectedby
modelling
In ad.
The difference ingenetic
merit for sire 56(1.40 vs
1.74 under the homoskedastic and the residual heteroskedastic modelsrespectively)
is alsoexplained by
the fact that the calves of this sire were scoredexclusively by
classifier no 12 and in 1983(Year
=1).
Due tomodelling
Q
u,
this sire went downagain
(from
1.74 to 1.63 under the full heteroskedasticmodel)
because all its progenyare females with a lower
Q
u
component
than in males. Otherthings being equal,
a reduction in the
oru
variance results in alarger
ratio,
orequivalently
a smallerheritability
andconsequently
in ahigher shrinkage
of the estimatedbreeding
valuetoward the mean. In other
words,
if a decrease ingenetic
variance isignored,
siresabove the mean are overevaluated and sires below the mean are underevaluated.
Hypothesis checking
The
estimated
sire variance was7.08.
Variances of theClassi,
Year and Month factors are,After
modelling
residualvariances,
the distribution of standardized residualsbecame closer to
normality,
in terms of skewness andespecially
kurtosis. Thisphenomenon
was observed in the wholesample
and also in thesubsamples
definedby
the levels of the factor considered in re. On the otherhand,
normality
of the residuals was stable in thesubsamples
definedby
the factors absent from the re-model.
Normality
of the distribution of the standardized sire values in terms ofkurto-sis and
PP-plot
wascontinuously
damaged
at each step of the variancemodelling:
estimated kurtosis was0.61,
0.72 and0.90,
for thehomoskedastic,
residualhet-eroskedastic and
fully
heteroskedastic modelsrespectively.
Moreover,
skewness forthe 142 sire
genetic
meritsimproved
slightly during
that process: -0.09, -0.003and -0.03 for the same models
respectively.
Computational
aspectsProgrammes
were written in Fortran 77 on an IBM 3090by implementing
anEM
algorithm corresponding
to[15].
The convergence was fast: 15-20cycles
for heteroskedasticT e-models
with<7J
estimatedby
EM-R.EML((i) stage),
and 15-40cycles
forfully
heteroskedasticT
-models
or heteroskedasticT
e
-models
with randomeffects. CPU time was between 2-5 min per model fit
(estimation
of parameters andcomputation
of thelog marginal
likelihood.DISCUSSION AND CONCLUSION
This paper is an extension to u-components of variances of the
approach developed
by Foulley
et al(1992)
to considerheterogeneity
in residual variancesusing
astructural model to describe
dispersion
parameters, in a similar way asusually
done on subclass means.
In that respect, our main concern focuses on ways to render models as
parsi-monious as
possible
so as to reduce the number of parameters needed to assessheteroskedasticity
of variances. Aninteresting
feature of thisprocedure
is to assess,through
a kind ofanalysis
of variance, the effects of factorsmarginally
orjointly.
For instance, one can test
heterogeneity
of sire variances among breeds of damsbe related to a
segregating
majorgene)
can be tested whiletaking
into account theinfluence of other nuisance factors
(season, nutrition...).
However,
the power of thelikelihood ratio test for
detecting
heterogeneity
of variance can be a real issue inmany
practical
instances.From the
genetic point
of view, theapproach
isquite
general
since it can deal withheterogeneity
among within and betweenfamily
components ofvariances,
or amonggenetic
and environmental variances. Factors involved for u and e components of variance may be different or the same,making
the methodespecially
flexible. Ourmodelling
allows one to assume(or
eventest)
whether the ratios of variances orheritabilities are constant over levels of some
single
factor orcombination
of factors(Visscher
andHill,
1992).
If a constantheritability
or ratio of variances a =or
2
i/
or
2
among strata is
assumed,
the model involves the parameters y and aonly,
andreduces to
In o, ei 2
=we!re
withoru 2i
replaced
by
a;
j
0:
in the likelihood function. Theshrinkage
estimator for the variancesproposed by
eg, Gianola et al(1992),
follows the same idea of the
Bayesian
estimator described in theBayesian approach
section. When a Gaussian
prior
density
isemployed
for thedispersion
parametersY
, the
hyperparameter
6 acts
as a shrinker. But theBayesian approach
for a directshrinkage
of variance components assumes thatheterogeneity
in such components(residual
and ucomponents)
is dueonly
to one factor. Theapproach presented
in this paper is more
general
since it can cope with morecomplex
structures of stratification which may differ from one component to the other.Moreover,
its mixed model structure allows greatflexibility
toadjust
variances in relation to the amount of information for factors in themodel;
eg when dataprovide
little information for some factors
(or levels)
or considerable forothers,
ourprocedure
behaves like BLUP(or James-Stein)
ie shrink estimates ofdispersion
parameters toward zero if there is littleinformation;
only
with sufficient informationcan the estimate deviate. For
instance,
ourmethodology provides
asimple
andrational
procedure
to shrink herd variances(whatever
they
are,genetic,
residualor
phenotypic)
towards differentpopulation
values(eg
regions,
asproposed by
Wiggans
andVanRaden,
1991)
due to poor accuracy of within herd orherd-year
variances
(Brotherstone
andHill,
1986).
It then suffices to use a hierarchical(linear)
mixed model for herd
log-variances
and take thepopulation
factor( eg region)
asfixed and herd as random within that factor. An illustration of the
flexibility
andfeasibility
of ourprocedure
wasrecently
given
by Weigel
(1992)
inanalyzing
sourcesof
heterogeneous
variances for milk and fatyield
in US Holsteins.Coming
back to the case of aunique
factor of variation for the sirevariances,
one can think of asimpler model,
such as yZ!,! =m
i + uij +
e2!!(i
= 1, ...I; j
=1, ... J; k
=l, ...
n2! ),
where J.li is the mean effect of environmenti,
uij is the(random)
effect ofthe jth
sire in the ithenvironment,
such that ui ={u2!}! !
N(O,a!,A),
and e2!! is the residual effectpertaining
to the kth calf of thejth
sire in the ith environment.
Usually
in such hierarchicalmodels,
it is assumed that Cov(u
i
,
Ui’
)
= 0 if i=A
i’. On the contrary, ourmodelling procedure
via thechange
in variablesu!
=Qu! u2!
(see
!4) )
takes into account covariances amongthe same
(or
genetically
related)
sires used in differentherds,
ieCov (ui, uj, )
=a u , a u
&dquo;
A!i! (Aii!
=relationship
matrixpertaining to
ui andu
i
, )
and so recoversthe inter-block information. The loss in power in
hypothesis
testing
due toignoring
Although
thispresentation
is restricted to asingle
random factoru
*
,
it can begeneralized
to amultiple
random factor situation. If such factors areuncorrelated,
the extension is
straightforward.
When covariancesexist,
one maysimply
assume, asproposed
by
Quaas
et al(1989),
thatheterogeneity
in covariances is due toscaling.
This means, forinstance,
that in a sire(s
i
)-maternal
grand
sire(t!)
model Yijk =
X!. k 13 + Si
+t!
+e2!k,
one will modelas
h’
2
a th 2
aspreviously,
and assume thatthe covariance is Q9t,, =
pa sha t h
for stratum h. If the model is
parameterized
interms of direct ao and maternal am effects as follows
through
the transformations
i =
ao
i/
2
and t
j
=ao!/4
+a
m;
/2,
one can set thegenetic
correlation pa to aconstant, ie
a
ao
=P
a
a
aoh
O
a
,
n
-
Notice that this condition is notequivalent
tothe
previous
one, except ifa
aoh
/!d&dquo;,h
does notdepend
on h.Although
themethodology
isappealing,
attention must be drawn to thefeasi-bility
of the method. The firstproblem
is the inversion of the coefficient matrix in[16]
required
for thecomputation
of the variance system(15].
In animalbreeding
applications,
this matrix isusually
verylarge.
Thislimiting
factor isalready
becom-ing
lessimportant
due to constant progress incomputing
software and hardware.The
technique
ofabsorption
isusually
used to reduce the size of matrices to invert.Another
approach
is toapproximate
the inverse. One can, for instance, use aTaylor
seriesexpansion
of order N for a square invertible matrix Awhere the square matrix
A
o
is a matrix close to A and is, of course, easy to invert, andwhere 11
-11
denotes some norm on the space of invertible matrices. Methodsviewed in Boichard et al
(1992)
can alsohelp
toapproximate A-
1
inparticular
cases such as sparse matrices, &dquo;animal
model&dquo;,
etc.Statistical power for likelihood ratio tests was
investigated
for detection ofheterogeneous
variances in the usualdesigns
ofquantitative genetics
and animalbreeding.
Resultsgiven
by
Visscher(1992)
and Shaw(1991)
indicategenerally
low power values fordetecting heterogeneity
ingenetic
variance.According
toShaw,
a nested
design
of 900 individuals out of 100 sire familiesprovides
a power of 0.5 forgenetic
variancesdiffering
by
a factor of 2.5. Thisclearly
indicates theminimum
requirements
insample
size andfamily
numbers which should be metbefore
carrying
out such ananalysis
and the limits therein. Therefore it seemsunrealistic to model
genetic
variances inpractice
according
to more than 1 or 2factors,
and itmight
be wise to consider some of them as random if little informationis
provided
by
the data in each level of such factors.Although
statistical constraints are satisfied for estimates of the parameters ofthe model
(positive
variance estimates, intra-class correlations within[-1, + l]J,
some
genetic
constraints such as about theheritability
estimate(h
2
=4a!/(a!+ae)
for a sire
model)
ranging
within[0,1]
are notimposed by
our model. This can bedealt with
by
choosing
appropriate
prior distributions on thedispersion
parametersthat would take this constraint into account, but this
procedure
appears to beextremely complicated. Fortunately,
theunconstrained
solutions are the constrainedmaximization
procedure
under constraints must beperformed
or theposterior
distribution under the constraint can be obtained from the unconstrained
posterior
distribution
multiplied by
a corrective factor(Box
andTiao,
1973).
Thisproblem
does not occur with an &dquo;animalmodel&dquo;,
but can arise when a &dquo;sire model&dquo; isused,
and is not
specifically
related toheteroskedasticity.
From a statistical point of view, the
procedure
uses the concept of variance function(Davidian
andCarroll,
1987)
as an extension todispersion
parameters of the link function. Our presentation focuses on thelog
link function which is themost common choice in this field
(see
for instance SanCristobal,
1992,
for a review of variancemodels)
&dquo;forphysical
and numerical reasons&dquo;(Nair
andPregibon,
1988).
Following
Davidian and Carroll(1987)
orDuby
et al(1975),
thequestion
can beasked whether or not variances vary
according
to means or location parameters. Inthe
Maine-Anjou
data, however,
it does not seem to be the case, thusvalidating
our choice in
!10!.
It would be
interesting
to extend our method to afully generalized
linear mixedmodel on means and on variances with or without common parameters between
the mean model and the variance model. Numerical
integration
or Gibbssampling
procedures
would then berequired
although
approximate methods of inferencecan also be used for such models
(Breslow
andClayton,
1992;Firth,
1992).
Statistical
problems arising
with common parameters arealready highlighted by
vanHouwelingen
(1988).
With a
fully
fixed effect variancemodel,
techniques
of estimation andhypothesis
testing
fordispersion
parameterspresented
here are those of the classicaltheory
oflikelihood inference
(likelihood
and likelihood ratiotest),
except that themarginal
likelihood functionL(
Y
;y)
waspreferred
to the usual likelihoodL(13,
T;y),
in thelight
of ideas behind REML estimators of variance components. This test reducesto Bartlett’s test
(Bartlett, 1937)
for a one classification model in variances andunder a saturated fixed model on the means
(ie
i
ji
=y
i
,
i = 1, ...I).
Unfortunately,
Bartlett’s test is known to be sensitive to
departure
fromnormality
(Box, 1953).
Simulations are needed tostudy
the robustness of this test and othercompeting
tests. From a
Bayesian
perspective,
theBayes
factor isusually applied
forhypothesis
testing
(see
Robert, 1992,
for adiscussion).
Theposterior
Bayes
factor(Aitkin,
1991)
could also be used to comparedispersion
models,
but numericalintegration
would then berequired
(see
theexpression
of the likelihood in!18!).
In this paper, focus was on an
appropriate
way to modelheterogeneous
vari-ances, but the initial motivation was a bestfitting
of location parameters(animal
evaluation for animal
breeders).
This difficultproblem
offeedback,
also related tothe Behrens-Fisher
problem,
has to be solved in ourparticular approach.
Moreover,
a greatresearch
perspective
is open on theimportant
andcomplicated
question of the