HAL Id: hal-00894119
https://hal.archives-ouvertes.fr/hal-00894119
Submitted on 1 Jan 1996
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Restricted maximum likelihood estimation for animal
models using derivatives of the likelihood
K Meyer, Sp Smith
To cite this version:
Original
article
Restricted maximum likelihood
estimation for animal models
using
derivatives
of
the likelihood
K
Meyer,
SP Smith
Animal Genetics and
Breeding Unit, University of
NewEngland, Armidale,
NSW2351,
Australia(Received
21 March1995; accepted
9 October1995)
Summary - Restricted maximum likelihood estimation
using
first and second derivatives of the likelihood is described. It relies on the calculation of derivatives without the needfor
large
matrix inversionusing
an automatic differentiationprocedure.
In essence, thisis an extension of the
Cholesky
factorisation of a matrix. Areparameterisation
is usedto transform the constrained
optimisation
problem imposed
inestimating
covariance components to an unconstrainedproblem,
thusmaking
the use ofNewton-Raphson
and related
algorithms
feasible. A numericalexample
isgiven
to illustrate calculations. Several modifiedNewton-Raphson
and method ofscoring algorithms
arecompared
forapplications
toanalyses
of beef cattledata,
and contrasted to a derivative-freealgorithm.
restricted maximum likelihood
/
derivative/
algorithm/
variance component esti-mationRésumé - Estimation du maximum de vraisemblance restreinte pour des modèles in-dividuels par dérivation de la vraisemblance. Cet article décrit une méthode d’estimation
du maximum de vraisemblance restreinte utilisant les dérivées
première
et seconde de la vraisemblance. La méthode est basée sur uneprocédure
dedifférenciation
automatique nenécessitant pas l’inversion de
grandes
matrices. Elle constitue enfait
une extension de ladécomposition
deCholesky appliquée
à une matrice. On utilise unparamétrage
quitrans-forme
leproblème d’optimisation
avec contrainte que soulève l’estimation des composantesde variance en un
problème
sans contrainte, ce qui rendpossible
l’utilisationd’algorithmes
de
Newton-Raphson
ouapparentés.
Les calculs sont illustrés sur unexemple numérique.
Plusieurs
algorithmes,
de typeNewton-Raphson
ou selon la méthode des scores, sontap-pliqués
àl’analyse
de données sur bovins à viande. Cesalgorithmes
sontcomparés
entreeu! et par ailleurs
comparés
à unalgorithme
sans dérivation.maximum de vraisemblance restreinte
/
dérivée/
algorithme/
estimation decomposante de variance *
INTRODUCTION
Maximum likelihood estimation of
(co)variance
components
generally
requires
the numerical solution of a constrained nonlinearoptimisation
problem
(Harville,
1977).
Procedures to locate the minimum or maximum of a function are classified
according
to the amount of information from derivatives of the function
utilised;
see, forinstance,
Gill et al(1981).
Methodsusing
both first and second derivatives arefastest to converge, often
showing quadratic
convergence, while searchalgorithms
notrelying
on derivatives aregenerally
slow, ie, require
many iterations and function evaluations.Early applications
of restricted maximum likelihood(REML)
estimation toanimal
breeding
data used a Fisher’s method ofscoring
type
algorithm, following
the
original
paperby
Patterson andThompson
(1971)
andThompson
(1973).
This
requires
expected
values of the second derivatives of the likelihood to beevaluated,
whichproved computationally
highly demanding
for all but thesimplest
analyses.
Henceexpectation-maximization
(EM)
type
algorithms gained popularity
and foundwidespread
use foranalyses fitting
a sire model.Effectively,
these use first derivatives of the likelihood function.Except
forspecial
cases,however,
they
required
the inverse of a matrix of sizeequal
to the number of random effectsfitted,
eg, number of sires times number of
traits,
whichseverely
limited the size ofanalyses
feasible.For
analyses
under the animalmodel,
Graser et al(1987)
thusproposed
aderivative-free
algorithm.
Thisonly
requires
factorising
the coefficient matrix of the mixed-modelequations
rather thaninverting it,
and can beimplemented efficiently
using
sparse matrixtechniques. Moreover,
it isreadily
extendable to animal modelsincluding
additional random effects and multivariateanalyses
(Meyer,
1989,
1991).
Multi-trait animal modelanalyses
fitting
additional random effectsusing
aderivative-free
algorithm
have been shown to be feasible.However,
they
arecom-putationally highly demanding,
the number of likelihood evaluationsrequired
in-creasing
exponentially
with the number of(co)variance
components
to be estimatedsimultaneously.
Groeneveld et al(1991),
forinstance,
reported
that 56 000 evalu-ations wererequired
to reach achange
in likelihood smaller than10-
7
whenesti-mating
60 covariancecomponents
for five traits. Whilejudicious
choice ofstarting
values and searchstrategies
(eg,
temporary
maximisation withrespect
to a subsetof the
parameters
only)
together
withexploitation
ofspecial
features of the datastructure
might
reduce demandsmarkedly
for individualanalyses,
it remains true that derivative-free maximisation inhigh
dimensions is very slow to converge.This makes a case for REML
algorithms
using
derivatives of the likelihoodfor
multivariate,
multidimensional animal modelanalyses.
Misztal(1994)
recently
presented
acomparison
of rates of convergence of derivative-free and derivativealgorithms, concluding
that the latter had thepotential
to be faster in almost all cases, inparticular
that their convergence ratedepended
little on the number of traits considered.Large-scale
animal modelapplications using
an EMtype
algorithm
(Misztal, 1990)
or even a method ofscoring
algorithm
(Ducrocq, 1993)
have been
reported, obtaining
thelarge
matrix inverse(or
itstrace)
required by
REML estimation under an animal model
using
first and second derivatives of thelikelihood
function, computed
withoutinverting large
matrices.DERIVATIVES OF THE LIKELIHOOD
Consider the linear mixed model
where y,
b,
u and e denote the vectors ofobservations,
fixedeffects,
random effects and residual errors,respectively,
and X and Z are the incidence matricespertaining
to b and u. LetV(u)
=G,
V(e)
= R andCov(u,e’)
=0,
so thatV(y)
= V =ZGZ’
+ R.Assuming
a multivariate normaldistribution, ie,
y -
N(Xb, V),
thelog
of the REML likelihood(G)
is(eg,
Harville,
1977)
where X* denotes a full-rank submatrix of X.
REML
algorithms
using
derivatives havegenerally
been derivedby differentiating
!2!.
However,
as outlinedpreviously
(Graser
etal, 1987; Meyer,
1989),
log L
can berewritten as
where C is the coefficient matrix in the mixed-model
equations
(MME)
pertaining
to
[1]
(or
a full rank submatrixthereof),
and P is amatrix,
Alternative forms of the derivatives of the likelihood can then be obtained
by differentiating
[3]
instead of!2!.
Let 0 denote the vector ofparameters
to be estimated with elements9z ,
i =1, ... ,
p. The first and secondpartial
derivatives of thelog
likelihood are thenGraser et al
(1987)
show how the last two terms in!3!,
log
ICI
andy’Py,
canbe evaluated in a
general
way for all models of form[1]
by carrying
out a seriesof Gaussian elimination
steps
on the coefficient matrix in the MMEaugmented
by
the vector of
right-hand
sides and aquadratic
in the data vector.Depending
on themodel of
analysis
and structure of G andR,
the other two termsrequired
in!3!,
1991),
generally
requiring only
matrixoperations proportional
to the number of traits considered. Derivatives of these four terms can be evaluatedanalogously.
Calculating
logiC!
andy’Py
and their derivativesThe mixed-model matrix
(MMM)
oraugmented
coefficient matrixpertaining
to[1]
iswhere r is the vector of
right-hand
sides in the MME.Using general
matrixresults,
the derivatives oflog
C !
arePartitioned matrix results
give
log
IMI
=log
!C!
+log(y’Py),
ie,
(Smith, 1995)
This
gives
derivativesObviously,
theseexpressions
([7], [8], [10]
and[11])
involving
the inverse of thelarge
matrices M and C arecomputationally
intractable for any sizable animal modelanalysis.
However,
the Gaussian eliminationprocedure
withdiagonal
pivoting
advocatedby
Graser et al(1987)
isonly
one of several ways to ’factor’a matrix. An alternative is a
Cholesky
decomposition.
This lends itselfreadily
to the solution of
large
positive
definitesystems
of linearequations
using
sparsematrix
storage
schemes.Appropriate
Fortran routines aregiven,
forinstance,
by
George
and Liu(1981)
and have been usedsuccessfully
in derivative-free REMLfor j
>i)
denote theCholesky
factor ofM, ie,
M = LL’. The determinant of atriangular
matrix issimply
theproduct
of itsdiagonal
elements.Hence,
with Mdenoting
the size ofM,
and
from
I
Smith
(1995)
describesalgorithms,
outlinedbelow,
which allow the derivatives of theCholesky
factor of a matrix to be evaluated whilecarrying
out thefactorisation,
provided
the derivatives of theoriginal
matrix arespecified.
Differentiating
[13]
and[14]
thengives
the derivatives oflog
ICI
andy’Py
assimple
functions of thediagonal
elements of theCholesky
matrix and its derivatives.Calculating
logIRI
and its derivativesConsider a multivariate
analysis
for q traits and let y be orderedaccording
to traits within animals.Assuming
that error covariances between measurements ondifferent animals are zero, R is
blockdiagonal
foranimals,
where is N the number of animals which have
records,
and E
+
denotes the direct matrix sum(Searle,
1982).
Hencelog
IRI
as well as its derivatives can be determinedby
considering
one animal at a time.combinations of traits recorded
(assuming
single
records pertrait),
eg, W = 3 forq = 2 with combinations trait 1
only,
trait 2only
and both traits. For animal i which has combination of traits w,R
i
isequal
toE
w
,
the submatrix of E obtainedby deleting
rows and columnspertaining
tomissing
records. As outlinedby Meyer
(1991),
thisgives
where
N
w
represents
the number of animalshaving
records for combination of traitsw.
Correspondingly,
Consider the case where the
parameters
to be estimated are the(co)variance
components
due to random effects and residual errors(rather
than,
forexample,
p
heritabilities and
correlations),
so that V is linear in0,
ie,
V =£
9
j0V/
0
9z.
i=lDefining
with elements
d
kL
=1,
if the klth elementofE
w
isequal
to9z ,
anddk!
= 0otherwise,
this thengives
Let
e!
denote the rsth element ofEw
l
.
For ()
i
= ekl and9
j
=e&dquo;, n
,
[23]
and[24]
then
simplify
toFor
q = 1 and R =(
T
2j,
[25]
and[26]
becomeN
QE
2
and-N(j
E4
,
respectively
(for
o
i
= oj
=U2
E ).
Extensions for models withrepeated
records arestraightforward.
Hence,
once the inverses of the matrices of residual covariances for all combinationof numbers of traits recorded
occurring
in the data have been obtained(of
maximum
size
equal
to the maximum number of traits recorded peranimal,
and alsorequired
to set up the
MMM),
evaluation oflog
!R!
and its derivativesrequires
only
scalarmanipulations
in addition.Calculating
loglGI
and its derivativesTerms
arising
from the covariance matrix of randomeffects, G,
can often bedetermined in a similar way,
exploiting
the structure of G. Thisdepends
on therandom effects fitted.
Meyer
(1989, 1991)
describeslog
IGI
for various cases.Define T with elements
t
ij
of size rq x rq as the matrix of covariances betweenrandom effects where r is the number of random factors in the model
(excluding
e).
Forillustration,
let u consist of a vector of animalgenetic
effects a and some uncorrelated additional randomeffect(s)
c withN
c
levels pertrait, ie,
u’ =(a’c’).
In the
simplest
case, a consists of the direct additivegenetic
effects for each animaland
trait, ie,
it haslength
qN
A
whereN
A
denotes the total number of animals in theanalysis, including
parents
without records. In other cases, amight
include asecond
genetic
effect for each animal andtrait,
such as a maternal additivegenetic
effect,
which may be correlated to the directgenetic
effects. Anexample
for c is a common environmental effect such as a litter effect.With a and c
uncorrelated,
T can bepartitioned
intocorresponding diagonal
blocks
T
A
andT
C
,
so thatwhere A is the numerator
relationship
betweenanimals, F,
often assumed to be theidentity
matrix,
describes the correlation structureamongst
the levels of c, andx denotes the direct matrix
product
(Searle, 1982).
Thisgives
(Meyer,
1991)
Noting
that all8
2
T/8()i8()j
= 0(for
V linear in0),
derivatives arewhere
DA
=8T
A/
a9
i
andD!
=8T
c/
8()
i
areagain
matrices with elements 1if tkl =
()
i
and zero otherwise. Asabove,
all second derivativesfor O
i
and9
j
notpertaining
to the same random factor(eg, c)
or two correlated factors(such
asdirect and maternal
genetic
effects)
are zero.Furthermore,
all derivatives oflog
G ) I
with
respect
to residual covariancecomponents
are zero.Further
simplifications analogous
to[25]
and[26]
can be derived. Forinstance,
random effects
(r
=1),
T is the matrix of additivegenetic
covariancesai! with
i, j = 1, ... , q. For O
i
= ax!and O
j
=a mn
, this
gives
with ars
denoting
the rsth element ofT-
1
.
For
q = 1 and all =QA
,
[31]
and[32]
reduce toN
A
(jA
2
and
-N
A
(jA
4
,
respectively.
Derivatives of the mixed model matrix
As
emphasised
above,
calculation of the derivatives of theCholesky
factor of Mrequires
thecorresponding
derivatives of M to be evaluated.Fortunately,
these have the same structure as M and can be evaluated whilesetting
upM,
replacing
G and R
by
their derivatives.For O
i
and O
j
equal
to residual(co)variances,
the derivatives of M are of theform
with
Q
R
standing
in turn forand
for first and second
derivatives, respectively.
As outlinedabove,
R isblockdiagonal
for animals with submatrices
E
w
.
Hence,
matricesQ
R
have the same structure with submatricesand
(for
V linear in 0 so thaté
P
R/8()/}()j =
0)
Consequently,
the derivatives of M withrespect
to the residual(co)variances
canbe set up in the same way as the ’data
part’
of M. In addition tocalculating
thematrices
Ew
l
for the W combination of records per animaloccurring
in thedata,
all derivatives of the
E-
1
for residualcomponents
need to evaluated. The extra calculationsrequired, however,
aretrivial, requiring
matrixoperations proportional
to the maximum number of records per animal
only
to obtain the terms in[36]
Analogously,
for O
i
and O
j
equal
to elements ofT,
derivatives of M arewith
Q
G
standing
forfor first
derivatives,
andfor second derivatives.
As
above,
furthersimplifications
arepossible depending
on the structure of G. Forinstance,
for G as in[27]
and[j
2
G/å()
J
}()j
=0,
and
Qo
2
=
Expected
values of second derivatives oflogc
Differentiating
[2]
gives
second derivatives oflogc
with
expected
values(Harville, 1977)
Again,
for V linear in0,
(9’VlaOiaO
j
= 0. From[5]
andnoting
thataPla0
i
=Hence, expected
values of the second derivatives areessentially
(sign
ignored)
equal
to the observed values minus the contribution from thedata,
and thus can beevaluated
analogously.
With second derivatives ofy’Py
notrequired, computational
requirements
are reduced somewhat asonly
the first M — 1 rows of82M/8()i8()j
need to be evaluated and factored.
AUTOMATIC DIFFERENTIATION
Calculation of the derivatives of the likelihood as described above relies on the
fact that the derivatives of the
Cholesky
factor of a matrix can be obtained’automatically’, provided
the derivatives of theoriginal
matrix can bespecified.
Smith
(1995)
describes a so-called forwarddifferentiation,
which is astraight-forward
expansion
of the recursionsemployed
in theCholesky
factorisation of amatrix M.
Operations
to determine the latter aretypically
carried outsequentially
by
rows. LetL,
of sizeN,
be initialised to M.First,
thepivot
(diagonal
element which must begreater
than anoperational
zero)
is selected for the current row k.Secondly,
theoff-diagonal
elements for the row(’lead column’)
areadjusted
( L
jk
for j
= k +1, ... ,
N),
andthirdly
the elements in theremaining
part
of L(L2!
forj
=k+1, ... ,
N and i =j, ... ,
N)
are modified(’row
operations’).
After all N rows have beenprocessed,
L contains theCholesky
factor of M.Pseudo-code
given
by
Smith(1995)
for the calculation of theCholesky
factor and its first and second derivatives is summarised in table I. It can be seen that theoperations
to evaluate a second derivativerequire
therespective
elements ofthe two
corresponding
first derivatives. Thisimposes
severe constraints on thememory
requirements
of thealgorithm.
While it is most efficient to evaluate theCholesky
factor and all its derivativestogether,
considerable space can be savedby
computing
the second derivatives one at a time. This can be doneby holding
allthe first derivatives in memory, or, if core space is the
limiting
factor, storing
firstderivatives on disk
(after
evaluating
themindividually
aswell)
andreading
inonly
the two
required. Hence,
the minimum memoryrequirement
for REMLusing
first and second derivatives is 4 xL, compared
to L for a derivative-freealgorithm.
Smith
(1995)
statedthat, using
forwarddifferentiation,
each first derivativerequired
not more than twice the workrequired
to evaluatelog
Gonly,
and that the work needed to determine a second derivative would be at most four times thatto calculate
log
G.In
addition,
Smith(1995)
described a ’backward differentiation’scheme,
sonamed because it reverses the order of
steps
in the forward differentiation. It isapplicable
for cases where we want to evaluate a scalar function ofL,
f (L),
in our caselog I C + y’Py
which is a function of thediagonal
elements of L(see
[13]
and!14!).
Itrequires computing
a(lower
triangular)
matrix Wwhich,
oncompletion
ofthe backward
differentiation,
contains the derivatives off (L)
withrespect
to the elements of M. First derivatives off (L)
can then be evaluated one at a time astr(W 8M/ 8()r)’
The
pseudo-code given
by
Smith(1995)
for the backward differentiation is shown in table II. Calculation of Wrequires
about twice as much work as one likelihoodevaluation, and,
once W isevaluated,
calculating
individual derivatives(step
3 indifferentiation
requires
only
somewhat more work than calculation of one derivativeby
forward differentiation. Smith(1995)
also described the calculation of second derivativesby
backward differentiation(pseudo-code
not shownhere).
Amongst
othercalculations,
this involves one evaluation of a matrix W as describedabove,
for each
parameter
andrequires
another work array of size L in addition to space to store at least one matrix of derivatives of M. Hence the minimum memoryrequirement
for thisalgorithm
is 3 x L + M(M
and Ldiffering by
the fill-in createdduring
thefactorisation).
Smith(1995)
claimed that the total workrequired
to evaluate all second derivatives for pparameters
was no more than6p
times that for a likelihood evaluation.MAXIMISING THE LIKELIHOOD
Methods to locate the maximum of the likelihood function in the context of
variance
component
estimation arereviewed,
forinstance,
by
Harville(1977)
andSearle et al
(1992;
Chapter
8).
Most utilise thegradient
vector,
ie,
vector of first derivatives of the likelihoodfunction,
to determine the direction of search.Using
second derivativesOne of the oldest and most
widely
used methods tooptimise
a non-linear functionis the
Newton-Raphson
(NR)
algorithm.
Itrequires
the Hessian matrix of thefunction,
ie,
the matrix of secondpartial
derivatives of the(log)
likelihood withwhere
H’
={å2log£/å()iå()j}
and g’
=10logLI,90
il
are the Hessian matrixand
gradient
vector oflog
£
,
respectively,
both evaluated at 0 =0!.
While the NRalgorithm
can bequick
to converge, inparticular
for functionsresembling
aquadratic function,
it is known to be sensitive to poorstarting
values(Powell, 1970).
Unlike otheralgorithms,
it is notguaranteed
to convergethough global
convergence has been shown for some casesusing
iterativepartial
maximisation(Jensen
etal,
1991).
In
practice,
so-called extended or modified NRalgorithms
have been found tobe more successful. Jennrich and
Sampson
(1976)
suggested
step
halving,
applied
successively
until the likelihood is found toincrease,
to avoid’overshooting’.
Moregenerally,
thechange
in estimates for the tth iterate in[46]
isgiven
by
for the extended
NR,
B’
=—T!(H!) !,
whereT
’
is astep-size
scaling
factor. Theoptimum
forT
t
can
be determinedreadily
as the value which results in thelargest
increase in
likelihood, using
a one-dimensional maximisationtechnique
(Powell,
1970).
This relies on the direction of searchgiven by
H-
l
g
generally
being
a ’good’
direction and
that,
for -Hpositive definite,
there isalways
astep-size
which willincrease the likelihood.
Alternatively,
the use ofhas been
suggested
(Marquardt, 1963)
toimprove
theperformance
of the NRalgorithm.
This results in astep
intermediate between a NRstep
(!
=0)
anda method of
steepest
ascentstep (
K
large).
Again, !
can be chosen to maximise theincrease in
log G, though
forlarge
values of K thestep
size issmall,
so that there is no need to include a searchstep
in the iteration(Powell, 1970).
Often
expected
values of the second derivatives oflog
G
are easier to calculate than the observed values.Replacing -H
by
the information matrixresults in Fisher’s method of
scoring
(MSC).
It can be extended or modified in thesame way as the NR
algorithm
(Harville, 1977).
Jennrich andSampson
(1976)
andJennrich and Schluchter
(1986)
compared
NR andMSC, showing
that the MSC wasgenerally
more robustagainst
a poor choice ofstarting
values than theNR,
though
it tended to
require
more iterations.They
thus recommended a schemeusing
theMSC
initially
andswitching
to NR after a few rounds of iteration when the increasein
log
G betweensteps
was less than one.Using
first derivativesonly
Other
methods,
so-called variable-metric orQuasi-Newton procedures, essentially
use the same
strategies,
butreplace
Bby
anapproximation
of the Hessian matrix.An
interesting
variation hasrecently
beenpresented
by
Johnson andThompson
(1995).
Noting
that the observed andexpected
information were ofopposite sign
and differed
only by
a terminvolving
the second derivatives ofy’Py (see
[5]
and[45]),
they
suggested
using
the average of observed andexpected
information(AI)
toapproximate
the Hessian matrix. Since itrequires only
the second derivatives ofy’Py
to beevaluated,
each iterate iscomputationally considerably
lessdemanding
than a ’full’ NR or MSCstep,
and the same modifications as described above for theNR
(see
[47]
and[48])
can beapplied.
Initialcomparisons
of the rate of convergence andcomputer
timerequired
showed the AIalgorithm
to behighly
advantageous
over both derivative-free and
EM-type
procedures
(Johnson
andThompson,
1995).
Constraining
parameters
All the
Newton-type
algorithms
described aboveperform
an unconstrainedoptimisation.
Toestimating
(co)variance
components,
however,
werequire
variancesbe
non-negative,
correlations to be in the range from -1 to 1and,
for more thantwo
traits,
for them to be consistent with eachother;
moregenerally,
estimatedcovariance matrices need to be
(semi-)
positive
definite. As shownby
Hill andThompson
(1978),
theprobability
ofobtaining
parameter
estimates out of boundsdepends
on themagnitude
of the correlations(population values)
and increasesrapidly
with the number of traitsconsidered,
inparticular
forgenetic
covariancematrices.
For univariate
analyses,
a commonapproach
has been to setnegative
estimates of variancecomponents
to zero and continueiterating.
Partialsweeping
of B to handleboundary constraints, monitoring
andrestraining
the size ofpivots
relative to thecorresponding
(original)
diagonal
elements has been recommended(Jennrich
andSampson, 1976;
Jennrich andSchluchter,
1986).
Harville(1977;
section6.3)
dis-tinguished
between threetypes
oftechniques
tomodify
unconstrainedoptimisation
procedures
to accommodate constraints on theparameter
space.Firstly, penalty
techniques
operate
on a function which is close tolog L
except
at the boundarieswhere it assumes
large
negative
values whicheffectively
serve as abarrier,
deflecting
a further search in that direction.Secondly, gradient
projection techniques
aresuit-able for linear
inequality
constraints.Thirdly,
it may be feasible to transform theparameters
to be estimated so that maximisation on the new scale is unconstrained.Box
(1965)
demonstrated for severalexamples
that thecomputational
effort tosolve constrained
problems
can be reducedmarkedly by
eliminating
constraints sothat one of the more
powerful
unconstrainedmethods,
preferably
withquadratic
convergence, can be
employed.
For univariateanalysis,
an obvious way to eliminatenon-negativity
constraints is to maximiselog L
withrespect
to standard deviations and to square these on convergence(Harville,
1977).
Alternatively,
we couldestimate
logarithmic
values of variancecomponents
instead of the variances. Thisseems
preferable
totaking
square rootswhere,
onbacktransforming,
alargish
negative
estimatemight
become a substantialpositive
estimate of thecorresponding
variance. Harville
(1977)
cautioned however that such transformations may result in the introduction of additionalstationary points
on the likelihoodsurface,
andthus should be used
only
inconjunction
withoptimisation techniques ensuring
anFurther motivation for the use of transformations has been
provided
by
thescope to reduce
computational
effort or toimprove
convergenceby
making
theshape
of the likelihood function on the new scale morequadratic.
For multivariateanalyses
with one random effect andequal
design
matrices,
forinstance,
a canonicaltransformation allows estimation to be broken down into a series of
corresponding
univariate
analyses
(Meyer, 1985);
see Jensen and Mao(1988)
for a review. Harville and Callanan(1990)
considered different forms of the likelihood for bothNR and MSC and demonstrated how
they
affected convergence behaviour. Inparticular,
a ’linearisation’ was found to reduce the number of iteratesrequired
to reach convergenceconsiderably. Thompson
andMeyer
(1986)
showed how areparameterisation
aimed atmaking
variables less correlated couldspeed
up anexpectation-maximisation
algorithm dramatically.
For univariate
analyses
ofrepeated
measures data with several randomeffects,
Lindstrom and Bates
(1988)
suggested
maximisation oflog £
withrespect
to the non-zero elements of theCholesky decomposition
of the covariance matrix of random effects in order to remove constraints on theparameter
space andto
improve stability
of the NRalgorithm.
Inaddition, they
chose to estimatethe error variance
directly,
operating
on theprofile
likelihood of theremaining
parameters.
For severalexamples,
the authors found consistent convergence of theNR
algorithm
whenimplemented
this way, even for anoverparameterised
model.Recently,
Groeneveld(1994)
examined the effect of thisreparameterisation
forlarge-scale
multivariateanalyses
using
derivative-freeREML, reporting
substantialimprovements
inspeed
of convergence for both direct search(downhill Simplex)
andQuasi-Newton
algorithms.
IMPLEMENTATION
Reparameterisation
For our
analysis,
parameters
to be estimated are the non-zero elements of the lowertriangle
of the two covariance matrices E andT,
with Tpotentially
consisting
ofindependent diagonal
blocks,
depending
on the random effects fitted. LetThe
Cholesky decomposition
of U has the same structureEstimating
the non-zero elements of matricesL
u.
then ensurespositive
definite matricesU
r
ontransforming
back to theoriginal
scale(Lindstrom
andBates,
1988).
However,
for theU
r
representing
covariancematrices,
the ithdiagonal
element ofL
is
suggested
toapply
asecondary transformation, estimating
thelogarithmic
valuesof these
diagonal
elements andthus,
as discussed above for univariateanalyses
ofvariance
components,
effectively forcing
them to begreater
than zero.Let v denote the vector of
parameters
on the new scale. In order to maximiselog G
withrespect
to the elements of v, we need to transform its derivativesaccordingly.
Lindstrom and Bates(1988)
describebriefly
how to obtain thegradient
vector and Hessian matrix for aCholesky
matrixL
Ur
from those for the matrixU
r
(see
alsocorrections,
Lindstrom and Bates(1994)).
For the ith element of v,More
generally,
for a one-to-one transformation(Zacks, 1971)
where J with elements
8()d8v
j
is the Jacobian of 9 with respect to v.Similarly,
with the first
part
of[53]
equal
to the ith element of(0J’/0vj )(0 10g £/09k)
and
the secondpart
equal
toijth
element ofJ’ {
ô
2log L/ Ô()
i
Ô()
j
}J.
Consider one covariance matrix
U
r
at atime,
dropping
thesubscript
r forconvenience,
and let u,t andl
wx
denote the elements of U andL
U
,
respectively.
From U =LL’,
it follows thatwhere
min(s, t)
is the smaller value of s and t.Hence,
theijth
element of J for()i
= Ustand v
j
=l
wx
isFor
w # x
ands,
t ! x, this is non-zeroonly
if at least one of s and t isequal
tocases to consider :
For
example,
for q = 3 traits and six elements in B(ull,
ulz, U13, uzz, U23 andU33
)
and v(log(lil),
l
zi
,
l
si
,
),
log(<
22
1
32
,
log(l
33
))
Similarly,
theijth
element of}
j
iåv
V
å
k/
()
2
{å
(see
[53])
for9
k
=u, t
, vi =
l
wx
and1/! =
ly
z
isAllowing
for the additionaladjustments
due to thelog
transformation ofdiago-nals,
[56]
is non-zero in five cases(for
w 7!
t!
x andx, t <
w):
For the above
example,
thisgives
the first two derivatives of JIn some cases, a covariance
component
cannot beestimated,
forinstance,
the error covariance between two traits measured on different sets of animals. On thenew
parameter
may be non-zero. This can be accommodatedby, again,
notestimating
thisparameter
butcalculating
its new value from the estimates of theother
parameters,
similar to the way a correlation is fixed to a certain valueby
only
estimating
therespective
variances and thencalculating
the new covariance’estimate’ from them.
For
example,
consider three traits with the covariance between traits 2 and3,
U23, which are not estimable.Setting
u23 = 0gives
a non-zero value onthe
reparameterised
scale of!32
=-1
21
1
31
/1
22
.
Only
the other fiveparameters
(lii, 1
21
, 131, 122,
andl
33
)
are then estimated(ignoring
thelog
transformation ofdiagonals
here forsimplicity)
while an ’estimate’ ofl
32
is calculated from those of1
21
,
l
31
andl
33
according
to therelationship
above. Ontransforming
back to theoriginal
scale,
this ensures that the covariance between traits 2 and 3 remains fixedat zero.
Sparse
matrixstorage
Calculation of
log £
for use in derivative-free REML estimation under an animal model has been made feasible forpractically
sized data setsthrough
the use ofsparse matrix
storage
techniques.
Theseadapt readily
to include the calculation of derivatives of L. Onepossibility
is the use of so-called linked lists(see,
forinstance,
Tier and Smith
(1989)).
George
and Liu(1981)
describe several otherstrategies,
using a ’compressed’
storage
scheme whenapplying
aCholesky decomposition
to alarge symmetric, positive definite,
sparse matrix.Elements of the matrices of derivatives of L are subsets of elements of
L, ie,
existonly
for non-zero elements of L. Thus the samesystem
ofpointers
can be used for Land all its
derivatives, reducing
overheads forstorage
and calculation of addresses of individual elements(though
at the expense ofreserving
space for zero derivativescorresponding
to non-zerol2!).
Moreover,
schemes aimed atreducing
the ’fill-in’during
the factorisation ofM,
should also reducecomputational
requirements
fordetermining
derivatives.Matrix
storage
required
inevaluating
derivatives oflog
£
can be considerable: forp
parameters
to beestimated,
there are p first andp(p+1)/2
secondderivatives, ie,
up to
1!-p(p+3)/2
times as much space as forcalculating log £ only
can berequired.
Even for
analyses
with one random factor(animals)
only,
this becomesprohibitive
very
quickly, amounting
to a factor of 28for
q = 2 traits and p = 6parameters,
and 91 for q = 3 and p = 12.However,
while matricesaLla0
i
are needed to evaluatesecond
derivatives,
matrices8
2
L/
8
()
i
8()
j
are notrequired
aftera
2
log ICI/
8
()
i8
()
j
and8
2
y’Py
/
8()
i
8()
j
have been calculated from theirdiagonal
elements.Hence,
while it is most efficient to evaluate all derivatives of eachli!
simultaneously,
second derivatives can be determined one at a time after L and its first derivatives havebeen determined. This can be done
by
setting
up andprocessing
each8
2
M/8()
i
8()j
individually
thusreducing
memoryrequired dramatically.
Software
Cholesky
decomposition
of the covariance matrices(and
logarithmic
values of theirdiagonals),
as describedabove,
to remove constraints on theparameter
space. Prior toestimation,
theordering
of rows and columns 1 to M - 1 in M wasdetermined
using
the minimumdegree re-ordering performed by George
and Liu’s(1981)
subroutineGENQMD,
and their subroutine SBMFCT was used to establish thesymbolic
factorisation of M(all
M rows andcolumns)
and the associatedcompressed
storage
pointers, allocating
space for all non-zero elements of L. Inaddition,
the use of the average information matrix wasimplemented.
However,
this was done
merely
for thecomparison
of convergence rates withoutmaking
useof the fact that
only
the derivatives ofy’Py
wererequired.
For each
iterate,
theoptimal
step
size orscaling
factor(see
[47]
and!48!)
wasdetermined
by carrying
out aone-dimensional,
derivative-free search. This was doneusing
aquadratic approximation
of the likelihoodsurface, allowing
for up to fiveapproximation
steps
per iterate. Othertechniques,
such as asimple step-halving
procedure,
would be suitable alternatives.Both
procedures
describedby
Smith(1995)
to carry out the automaticdifferenti-ation of the
Cholesky
factor of a matrix wereimplemented.
Forward differentiationwas set up to evaluate L and all its first and second derivatives to be as efficient as
possible,
ie,
holding
all matrices of derivatives in core andevaluating
themsimulta-neously.
Inaddition,
it was set up to reduce memoryrequirements, ie,
calculating
theCholesky
factor and its first derivativestogether
andsubsequently
evaluating
second derivativesindividually.
Backward differentiation wasimplemented
storing
all first derivatives of M in core but
setting
up andprocessing
one second derivativeof M at a time.
EXAMPLES
To illustrate the
calculations,
consider the test data for a bivariateanalysis
given
by
Meyer
(1991).
Fitting
asimple
animalmodel,
there are sixparameters.
Table III summarises intermediate results for the likelihood and itsderivatives
for thestarting
values usedby Meyer
(1991),
andgives
estimates from the first round of iteration forsimple
or modified NR or MSCalgorithms.
Withoutreparameterisation,
the(unmodified)
NRalgorithm produced
estimates out of bounds of theparameter
space, while the MSC
performed
quite
well for this first iterate.Continuing
to useexpected
values of second derivativesthough,
estimates failed to converge.Figure
1 shows thecorresponding
change
inlog
G over rounds of iteration. Forthis small
example,
with astarting
value for theadditive
genetic
covariance(
QA12
)
very different from the eventual
estimate,
a NR or MSCalgorithm
optimising
thestep
size(see
equation
!47!)
failed to locate a suitablestep
size(which
increasedlog
G)
in the first iterate.Essentially,
allalgorithms
had reached about the samepoint
on the likelihood surfaceby
the fifthiterate,
with very littlechanges
inlog £
in
subsequent
rounds.Convergence
was considered achieved when the norm of thegradient
vector was less than10-
4
.
This was a rather strict criterion:changes
inestimates and likelihood values between iterates were
usually
very small beforelikelihood evaluations were
required
to reach convergenceusing
the observed informationcompared
toeight
iterates and 34 likelihood evaluations for the averageinformation while the MSC
(expected information)
failed in this case.Optimising
the
step
size(see
!47)),
7 + 40,
10 + 62 and 10 + 57 iterates + likelihood evaluationswere
required using
theobserved, expected
and averageinformation,
respectively.
As illustrated inFigure
2,
use of the’average
information’yielded
a very similarconvergence
pattern
to that of the NRalgorithm.
Incomparison, using
an EMalgorithm
for thisexample,
80 rounds of iteration wererequired
before thechange
in
log £
between iterates was less than10-
4
and 142 iterates were needed beforethe average
changes
in estimates was less than0.01%.
a
( T A ij
: additive
genetic
covariances; (T Eij : residualcovariances;
b
log
likelihood: valuesgiven
differ from thosegiven by Meyer
(1991)
by
a constant offset of 3.466 due toabsorp-tion of
single-link
parents without records(‘pruning’).
Components
of first derivatives oflog ,C;
see text for definition.d
First
derivative oflog
G with respect toB
i
.
e Secondderivative of
log £
with respect toBi.
f
Expected
value of second derivative oflog £
with respect toB
i
.
second derivatives oflog
G with respect toO
i
and0
j
:
observed valuesTable IV
gives
characteristics of the data structure and model ofanalysis
forapplied examples
of multivariateanalyses
of beef cattle data. The first is a bivariateanalysis
fitting
an animal model with maternalpermanent
environmental effects as an additional randomeffect, ie, estimating
nine covariancecomponents.
The secondshows the
analysis
of threeweight
traits in Zebu Cross cattle(analysed
previously;
see
Meyer
(1994a))
under three models ofanalysis,
estimating
up to 24parameters.
For each data set and
model,
thecomputational requirements
to obtainderiva-tives of the likelihood were determined
using
forward and backward differentiation!NR:
Newton-Raphson;
MSC: method ofscoring;
MIX:Starting
asMSC, switching
to NR whenchange
inlog
likelihood between iteratesdrops
below1.0;
and DF: derivative-freealgorithm.
b
CB:
Cannon bonelength;
HH:hip height;
WW:weaning weight;
YW:yearling weight;
FW: finalweight.
!1:simple
animalmodel;
2: animal modelfitting
dams’ permanent environmentaleffect;
and 5: animal modelfitting
bothgenetic
andpermanent environmental effects.
d
A:
Forwarddifferentiation, processing
all derivativessimultaneously;
B: Forwarddifferentiation, processing
second derivatives one at a time;C: Backward differentation. 0: No
reparameterisation,
unmodified;
S:reparameterised
scale, optimising
stepsize,
see[47]
in text; T:reparameterised scale, modifying diagonals
as described
by
Smith(1995).
If the memory availableallowed,
forwarddifferentia-tion was carried out for all derivatives
simultaneously
andconsidering
one matrix8
2
M/
8
()
i
8()j
at a time. Table IVgives computing
times in minutes forcalcula-tions carried out on a DEC
Alpha Chip
(DIGITAL)
machinerunning
underOSF/1
I(rated
at about 200Mflops).
Clearly,
the backward differentiation is mostcompet-itive,
with the timerequired
to calculate 24 first and 300 second derivativesbeing
’only’
about 120 times that to obtainlog L only.
Starting
values used wereusually ’good’,
using
estimates of variancecomponents
from
corresponding
univariateanalyses throughout
andderiving
initial guesses for covariancecomponents
from these and literature values for the correlations concerned. A maximum of 20 iterates was carriedout,
applying
a verystringent
convergence criterion of a
change
inlog L
less than10-
6
between iterates or a valuefor the norm of the
gradient
vector of less than10-
4
as above.The numbers of iterates and additional likelihood evaluations
required
for eachalgorithm
given
in table IV show small and inconsistent differences between them. On thewhole, starting
with a MSCalgorithm
andswitching
to NR when thechange
in
log
G became small andapplying Marquardt’s
(1963)
modification(see !48!)
tended to be more robust(ie,
achieve convergence when otheralgorithm(s) failed)
for
starting
values not too close to the eventual estimates.Conversely, they
tendedto be slower to converge than an NR
optimising
step
size(see
!47))
forstarting
values very close to the estimates.
However,
once the derivatives oflog
G havebeen
evaluated,
it iscomputationally
relatively inexpensive
(requiring
only
simple
likelihoodevaluations)
to compare and switch betweenalgorithms,
ie,
it should be feasible to select the bestprocedure
to be used for eachanalysis individually.
Comparisons
with EMalgorithms
were not carried out for theseexamples.
However,
Misztal(1994)
claimed that each sparse matrix inversionrequired
foran EM
step
took about three times aslong
as one likelihood evaluation. Thiswould mean that each iterate
using
second derivatives for the three-traitanalysis
estimating
24highly
correlated(co)variance components
wouldrequire
about thesame time as 40 EM
iterates,
or that the second derivativealgorithm
would beadvantageous
if the EMalgorithm
required
more than 462 iterates to converge.DISCUSSION
For univariate
analyses,
Meyer
(1994b)
found noadvantage
inusing Newton-type
algorithms
over derivative-free REML.However, comparing
total cpu time peranalysis, using
derivatives appears to behighly advantageous
over aderivative-free
algorithm,
for multivariateanalyses,
the more so thelarger
the number ofparameters
to be estimated.Furthermore,
foranalyses involving larger
numbers ofparameters (18
or24),
the derivative-freealgorithm
converged
several times to alocal
maximum,
aproblem
observedpreviously by
Groeneveld and Kovac(1990),
while the
Newton-type
algorithm appeared
not be affected.Modifications of the NR
algorithm,
combined with a local search for theoptimum
step
size,
improved
its convergence rate. This was achievedprimarily by
safe-guarding against
steps
in the’wrong’
direction inearly iterates,
thusmaking
thesearch for the maximum of the likelihood function more robust