HAL Id: hal-00894192
https://hal.archives-ouvertes.fr/hal-00894192
Submitted on 1 Jan 1998
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Restricted maximum likelihood estimation of
covariances in sparse linear models
Arnold Neumaier, Eildert Groeneveld
To cite this version:
Original
article
Restricted maximum likelihood
estimation of
covariances
in
sparse
linear models
Arnold
Neumaier
Eildert Groeneveld
a
Institut fur
Mathematik,
Universitat Wien,Strudlhofgasse
4, 1090Vienna,
Austriab
Institut
fiir Tierzucht undTierverhalten, Bundesforschungsanstalt
fur LandwirtschaftH61tystr.
10, 31535Neustadt, Germany
(Received
16 December1996; accepted
30September
1997)
Abstract - This paper discusses the restricted maximum likelihood
(REML)
approach
for the estimation of covariance matrices in linear stochasticmodels,
asimplemented
in the current version of the VCEpackage
for covariance component estimation inlarge
animalbreeding
models. The main features are:1)
therepresentation
of theequations
in anaugmented
form thatsimplifies
theimplementation;
2)
theparametrization
of the covariance matricesby
means of theirCholesky factors,
thusautomatically ensuring
theirpositive definiteness;
3)
explicit
formulas for thegradients
of the REML function for thecase of
large
and sparse modelequations
with alarge
number of unknown covariancecomponents and
possibly incomplete data, using
the sparse inverse to obtain thegradients
cheaply;
4)
use of modelequations
that make separate formation of the inverse of thenumerator
relationship
matrix unnecessary.Many large
scalebreeding problems
weresolved with the new
implementation,
among them anexample
with more than 250 000 normalequations
and 55 covariance components,taking
41 h CPU time on a HewlettPackard 755.
© Inra/Elsevier,
Parisrestricted maximum likelihood
/
variance component estimation/
missing data/
sparse inverse/
analytical gradientsRésumé - Estimation par maximum de vraisemblance restreinte de covariance dans les systèmes linéaires peu denses. Ce
papier
discute del’approche
par maximum de vraisemblance restreinte(REML)
pour l’estimation des matrices de covariances dans les modèleslinéaires, qu’applique
lelogiciel
VCE engénétique
animale. Lescaractéristiques
principales
sont :1)
lareprésentation
deséquations
sous formeaugmentée qui simplifie
lescalculs ;
2)
lereparamétrage
des matrices de variance-covariancegrâce
aux facteurs deCholesky qui
assure leur caractère définipositif ;
3)
les formulesexplicites
desgradients
de la fonction REML dans le cas dessystèmes d’équations
degrande
dimension et peu denses avec ungrand
nombre de composantes de covariances inconnues et éventuellement des données manquantes : elles utilisent les inverses peu denses pour obtenir lesgradients
de*
manière
économique ;
4)
l’utilisation deséquations
du modèlequi dispense
de la formationséparée
de l’inverse de la matrice deparenté.
Desproblèmes
degénétique
àgrande
échelleont été résolus avec la nouvelle version, et
parmi
eux unexemple
avecplus
de 250 000équations
normales et 55 composantes de covariance, demandant 41 h de CPU sur unHewlett Packard 755.
©
Inra/Elsevier,
Parismaximum de vraisemblance restreinte
/
estimation des composantes de variance/
données manquantes
/
inverse peu dense/
gradient analytique1. INTRODUCTION
Best linear unbiased
prediction
ofgenetic
merit[25]
requires
the covariance struc-ture of the model elements involved. Inpractical
situations,
these areusually
un-known and must be estimated.
During
recent years restricted maximum likelihood(REML)
[22, 42]
hasemerged
as the method of choice in animalbreeding
forvari-ance
component
estimation[15-17,
34-36].
Initially,
theexpectation
maximization(EM) algorithm
[6]
was used for theoptimization
of the REMLobjective
function[26, 47].
In 1987 Graser et al.
[14]
introduced derivative-freeoptimization,
which in thefollowing
years led to thedevelopment
of rathergeneral
computing
algorithms
andpackages
[15,
28, 29,
34]
that weremostly
based on thesimplex
algorithm
of Nelder and Mead
[40].
Kovac[29]
made modifications that turned it into a stablealgorithm
that nolonger converged
to noncriticalpoints,
but this did notimprove
its inherentinefficiency
forincreasing
dimensions. Ducos et al.[7]
used for the first time the more efficientquasi-Newton
procedure
approximating
gradients by
finite differences. While this
procedure
was faster than thesimplex algorithm
it was also less robust forhigher-dimensional problems
because the covariance matrixcould become
indefinite,
oftenleading
to false convergence.Thus,
either for lack ofrobustness
and/or
excessivecomputing
time oftenonly
subsets of the covariance matrices could be estimatedsimultaneously.
A
comparison
of differentpackages
[45]
confirmed thegeneral
observation of Gill[13]
thatsimplex-based optimization
algorithms
suffer from lack ofstability,
sometimesconverging
to noncriticalpoints
whilebreaking
downcompletely
at morethan three traits. On the other hand the
quasi-Newton procedure
withoptimization
on the
Cholesky
factor asimplemented
in ageneral
purpose VCEpackage [18]
wasstable and much faster than any of the other
general
purposealgorithms.
While this led to aspeed-up
of between two for smallproblems
and(for
someexamples)
200 times for
larger
ones ascompared
to thesimplex procedure, approximating
gradients
on the basis of finite differences was stillexceedingly costly
forhigher
dimensional
problems
[17].
It is well-known that
optimization
algorithms generally
perform
better withanalytic
gradients
if the latter arecheaper
tocompute
than finite differenceapproximations.
In this paper we
derive,
in the context of ageneral
statisticalmodel, cheap
analytical gradients
forproblems
with alarge
number p of unknown covariancecomponents
using
sparse matrixtechniques.
Withhardly
any additionalstorage
requirements,
the cost of a combined function andgradient
evaluation isonly
advantage
over finite differencegradients.
Misztal and Perez-Enciso[39] investigated
the use of sparse matrix
technique
in the context of an EMalgorithm
which isknown to have much worse convergence
properties
ascompared
toquasi-Newton
(see
alsoThompson
et al.[48]
for animprovement
in its spacecomplexity),
using
anLDL
T
factorization and the Takahashi inverse[9];
no results in a REMLapplication
weregiven.
A recent papersby Wolfinger
et al.[50] (based
again
onthe W
transformation)
andMeyer
[36] (based
on thesimpler
REMLobjective
formulation of Graser et al.
[14])
alsoprovide
gradients
(and
evenHessians),
butthere a
gradient
computation
needs a factor ofO(p)
more work and space thanin our
approach,
where thecomplete gradient
is found withhardly
any additional space and with(depending
on theimplementation)
two to four times the work fora function evaluation.
Meyer
[37]
used heranalytic
second derivatives in aNewton-Raphson algorithm
for
optimization.
Because theoptimization
was not restricted topositive
definite covariance matrixapproximations
(as
ouralgorithm
does),
she found thealgorithm
to bemarkedly
less robust than(the
already
not veryrobust)
simplex
algorithm,
even for univariate models.
We test the usefulness of our new formulas
by integrating
them into the VCEcovariance
component
estimationpackage
for animal(and
plant)
breeding
mod-els[17].
Here thegradient
routine is combined with aquasi-Newton optimization
method and with a
parametrization
of the covarianceparameters
by
theCholesky
factor that ensures definiteness of the covariance matrix. In the
past,
thiscombi-nation was most reliable and had the best convergence
properties
of alltechniques
used in this context[45].
Meanwhile,
VCE isbeing
usedwidely
in animal and evenplant breeding.
In the
past,
thelargest
animalbreeding
problem
ever solved([21],
using
aquasi-Newton
procedure
withoptimization
on theCholesky
factor)
comprised
233 796linear unknowns and 55 covariance
components
andrequired
48days
of CPU timeon a 100 MHz HP
9000/755
workstation.Clearly,
speeding
up thealgorithm
is ofparamount
importance.
In ourpreliminary implementation
of the new method(not
yet
optimized
forspeed),
wesuccessfully
solved this(and
an evenlarger problem
of more than 257 000
unknowns)
inonly
41 h of CPUtime,
with aspeed-up
factorof
nearly
28 withrespect
to the finite differenceapproach.
The new VCE
implementation
is available free ofcharge
from theftp
siteftp://192.108.34.1/pub/vce3.2/.
It has beenapplied successfully
throughout
the world to hundreds of animal
breeding
problems,
withcomparable
performance
advantages [1-3,
19, 21, 38, 46,
49].
In section 2 we fix notation for linear stochastic models and mixed model
equations,
define the REMLobjective
function,
and review closed formulas for itsgradient
and Hessian. In sections 3 and 4 we discuss ageneral setting
forpractical large
scalemodeling,
and derive an efficient way for the calculation ofREML function values and
gradients
forlarge
and sparse linear stochastic models.All our results are
completely
general,
not restricted to animalbreeding.
How-ever, for the formulas used in our
implementation,
it is assumed that the covariancematrices to be estimated are block
diagonal
with no restrictions on the(distinct)
The final section 5
applies
the method to asimple
demonstration case and severallarge
animalbreeding
problems.
2. LINEAR STOCHASTIC MODELS AND RESTRICTED LOGLIKELIHOOD
Many applications
(including
those to animalbreeding)
are based on thegener-alized linear stochastic model
with fixed effects
)3,
random effects u and noise 11. Herecov(u)
denotes the covariance matrix of a random vector u with zero mean.Usually,
G and D areblock
diagonal,
with many identical blocks.By
combining
the two noiseterms,
the model is seen to beequivalent
to thesimple
model y =X(3
+11
’,
whererl’
is a random vector with zero mean and(mixed model)
covariance matrix V =ZGZ
T
+ D.Usually,
V ishuge
and nolonger
blockdiagonal, leading
tohardly
manageable
normalequations
involving
the inverse of V.However,
Henderson[24]
showed that the normalequations
areequivalent
to the mixed modelequations
This formulation avoids the inverse of the mixed model covariance matrix V and is the basis of most modern methods for
obtaining
estimates of u andj3
inequation
(1).
Fellner
[10]
observed that Henderson’s mixed modelequations
are thenormal
equations
of anaugmented
model of thesimple
formwhere
Thus,
without loss ingenerality,
we may base ouralgorithms
on thesimple
model
[3],
with a covariance matrix C that istypically
blockdiagonal.
Thisautomatically
produces
the formulas thatpreviously
had to be derived in a lesstransparent
wayby
means of the Wtransformation;
cf.[5,
11,
23,
50J.
The ’normal
equations’
for the model[3]
have the formHere
AT
denotes thetransposed
matrix of A.By
solving
the normalequations
(4),
we obtain the best linear unbiased estimate(BLUE)
and,
for thepredictive
variables,
the best linear unbiasedprediction
(BLUP)
for the vector x, and the noise e = Ax - b is estimated
by
the residualIf the covariance matrix C =
C(w)
contains unknownparameters
w(which
we shall call
’dispersion
parameters’,
these can be estimatedby minimizing
the’restricted
loglikelihood’
quoted
in thefollowing
as the ’REMLobjective function’,
as a function of theparameters
w.(Note
that allquantities
in theright-hand
side ofequation
(6) depend
on C and hence on
w.)
More
precisely,
equation
(6)
is thelogarithm
of the restrictedlikelihood,
scaledby
a factor
of - 2 and
shiftedby
a constantdepending
only
on theproblem
dimension.Under the
assumption
of Gaussiannoise,
the restricted likelihood can be derivedfrom the
ordinary
likelihood restricted to a maximalsubspace
ofindependent
errorcontrasts
(cf.
Harville[22];
our formula(6)
is thespecial
case of his formula whenthere are no random
effects).
Under the sameassumption,
another derivation asa
limiting
form of aparametrized
maximum likelihood estimate wasgiven
by
Laird
[31].
When
applied
to thegeneralized
linear stochastic model(1)
in theaugmented
formulation discussedabove,
the REMLobjective
function(6)
takes thecomputa-tionally
most useful formgiven
by
Graser et al.[14].
The
following
proposition
contains formulas forcomputing
derivatives of the REML function. We writefor the derivative with
respect
to aparameter
w!occurring
in the covariance matrix.Proposition [22,
32, 42,
50].
Letwhere A and B are as
previously
defined andwhere
(Note
that,
since A is nonsquare, the matrix P isgenerally
nonzeroalthough
italways
satisfies PA= 0.)
3. FULL AND INCOMPLETE ELEMENT FORMULATION
For the
practical modeling
of linear stochasticsystems,
it is useful tosplit
model
(3)
into blocks of uncorrelated modelequations
which we call ’elementequations’.
The elementequations
usually
fall into severaltypes,
distinguished by
their covariance matrices. The model
equation
for an element v oftype y
has the formHere
All
is the coefficient matrix of the block ofequations
for element number v.Generally,
All
is very sparse with few rows and manycolumns,
most of them zero,since
only
a small subset of the variables occursexplicitly
in the vth element. Each modelequation
hasonly
one noise term. Correlated noise must beput
into one element. All elements of the same
type
are assumed to havestatistically
independent
noisevectors,
realizations of(not
necessarily
Gaussian)
distributions with zero mean and the same covariance matrix.(In
ourimplementation,
there are no constraints on theparametrization
of theC
o
y,
but it is not difficult tomodify
the formulas to handle more restrictedcases.)
Thus the various elements areassigned
to the
types
according
to the covariance matrices of their noise vectors. 3.1.Example
animalbreeding applications
In covariance
component
estimationproblems
from animalbreeding,
the vectorx
splits
into small vectors/3
k
of(in
ourpresent
implementation
constant)
size ntraitcalled ’effects’. The
right-hand
side b contains measured data vectors y, and zeros.Each index v
corresponds
to some animal. The varioustypes
of elements are asfollows.
Measurement elements: the measurement vectors y&dquo; E
lR
nt
ra
’t
areexplained
interms of a linear combination of effects
(3
i
C7Rnt!a’t,
Here the
i
w
i
form an nrecx ff enindexmatrix,
the J.1vl form an nrecx neffcoefficientmeasurement matrix are concatenated so that a
single
matrixcontaining
thefloating
point
numbers results. If the set of traitssplits
into groups that are measured ondifferent sets of
animals,
the measurement elementssplit
accordingly
into severaltypes.
Pedigree
elements: for someanimals,
identifiedby
the index T of their additivegenetic
effect(3T,
we may know theparents,
withcorresponding
indices V(father)
and M
(mother).
Theirgenetic
dependence
is modeledby
anequation
The indices are stored in
pedigree
records which contain a column of animal indicesT(v)
and two further columns for theirparents
(V(v), M(v)).
Random effect elements: certain effects
/3
R(
-y
)
h
=3, 4, ...)
are considered asrandom effects
by
including
trivial modelequations
As
part
of the model(13),
these trivial elementsautomatically produce
the traditional mixed modelequations,
asexplained
in section 2.We now return to the
general
situation. For elements numberedby v
=1, ..., N,
the full matrix formulation of the model
(13)
is the model(3)
withwhere
-y(v)
denotes thetype
of element v.A
practical algorithm
must be able to account for the situation that somecomponents
ofb,
aremissing.
We allow forincomplete
data vectors bby simply
deleting
from the full model the rows of A and b for which the data in b aremissing.
This is
appropriate
whenever the data aremissing
at random[43];
note that thisassumption
is also used in themissing
datahandling by
the EMapproach
[6, 27].
Sincedropping
rowschanges
the affected element covariance matrices and theirCholesky
factors in a nontrivial way, the derivation of the formulas forincomplete
data must be
performed
carefully
in order to obtain correctgradient
information. We therefore formalize theincomplete
element formulationby introducing
projec-tion matrices
P, coding
formissing
datapattern
[31].
If we defineP,
as the(0,1)
matrix withexactly
one 1 per row(one
row for eachcomponent present
inb,),
at most one 1 per column
(one
column for eachcomponent
ofb,),
thenP&dquo;A&dquo;
is the matrix obtained from
A,
by deleting
the rows for which data aremissing,
and
P,b,
is the vector obtained fromb,
by deleting
the rows for which data aremissing.
Multiplication
by
p
T
on theright
of a matrix removes the columnscor-responding
tomissing
components.
Conversely, multiplication by
p
T
on the left orP on the
right
restoresmissing
rows orcolumns,
respectively, by
filling
them with zeros.element
equations
where
The
incomplete
elementequations
can be combined to full matrix form(3),
withand the inverse covariance matrix takes the form
where
Note that
C!,
M
v
,
andlog det
C! (a
byproduct
of the inversion via aCholesky
factorization,
needed for thegradient
calculation)
depend
only
ontype
q(v)
andmissing
datapattern
P,,
and can becomputed
inadvance,
before the calculation of the restrictedloglikelihood
begins.
4. THE REML FUNCTION AND ITS GRADIENT IN ELEMENT FORM
From the
explicit representations
(16)
and(17),
we obtain thefollowing
formulasfor the coefficients of the normal
equations
After
assembling
the contributions of all elements into these sums, the coefficientmatrix is factored into a
product
oftriangular
matricesusing
sparse matrix routines[8, 20].
Prior to thefactorization,
the matrix isreor-dered
by
themultiple
minimumdegree
algorithm
in order to reduce the amountevaluation,
together
with asymbolic
factorization to allocatestorage.
Without loss ofgenerality,
and for the sake ofsimplicity
in thepresentation,
we may assumethat the variables are
already
in the correctordering;
our programsperform
thisordering
automatically, using
themultiple
minimumdegree ordering ’genmmd’
asused in
’Sparsepak’ [43].
Note that R is the
transposed
Cholesky
factor of B.(Alternatively,
one canobtain R from a sparse
QR
factorization ofA,
see e.g. Matstoms[33].)
To take care of
dependent
(or
nearly
dependent)
linearequations
in the modelformulation,
wereplace
in the factorization smallpivots <
sB
2i
by
1.(The
choiceE =
(macheps)2!3,
wheremacheps
is the machineaccuracy,
proved
to be suitable. Theexponent
is less than 1 to allow for some accumulation of roundoff errors, butstill
guarantees
2/3
of the maximalaccuracy.)
Tojustify
thisreplacement,
notethat in the case of consistent
equations,
an exact lineardependence
results in afactorization
step
as in thefollowing
In the presence of
rounding
errors(or
in case of neardependence)
we obtainentries of order
eB
ii
inplace
of thediagonal
zero.(This
even holds whenB
ii
is small but nonzero, since the usual bounds on therounding
errors scalenaturally
when the matrix is scaled
symmetrically,
and we may choose thescaling
such that nonzerodiagonal
entries receive the value one. Zerodiagonal
elements in apositive
semidefinite matrix occur for zero rows
only,
and remain zero in the eliminationprocess.)
If we addB
i
i
toR
i
i
whenRii
<eB
ii
and setRii
= 1 whenBii
=0,
thenear
dependence
iscorrectly
resolved in the sense that the extremesensitivity
orarbitrariness in the solution is removed
by
forcing
a smallentry
into the ithentry
of the solutionvector,
thusavoiding
the introduction oflarge
components
in nullspace directions.
(It
is useful to issuediagnostic
warnings giving
the indices of thecolumn indices i where such near
dependence
occurred.)
The determinant
is available as a
byproduct
of the factorization. The above modifications to copewith near linear
dependence
areequivalent
toadding prior
information on the distribution of theparameters
with those indices wherepivots
changed. Hence,
provided
that the set of indices wherepivots
are modified does notchange
with theiteration,
they
produce
a correct behavior for the restrictedloglikelihood.
If this set of indiceschanges,
theproblem
isill-posed,
and would have to be treatedseen a failure of the
algorithm
because of thepossible
discontinuity
in theobjective
function causedby
ourprocedure
forhandling (near) dependence.
Once we have the
factorization,
we can solve the normalequations R
T
Rx
= afor the vector x
cheaply by solving
the twotriangular
systems
(In
the case of anorthogonal
factorization one has instead to solve Rx =y, where
y
= Q
T
b.)
From the best estimate
x
for the vector x, we may calculate the residual aswith the element residuals
Then we obtain the
objective
function asAlthough
the formula for thegradient
involves the dense matrixB-
1
,
thegradient
calculation can beperformed using
only
thecomponents
ofB-
1
within thesparsity
pattern
ofR
T
+ R. Thispart
ofB-’
is called the’sparse
inverse’ of B and can becomputed cheaply;
cf.Appendix
1. The use of the sparse inverse forthe calculation of the
gradient
is discussed inAppendix
2.The
resulting algorithm
for the calculation of a REML function value and itsgradient
isgiven
in tableI,
in a form that makesgood
use of dense matrixalgebra
in the case of
larger
covariance matrix blocksCl,.
Thesymbol
EB denotesadding
adense subvector
(or submatrix)
to thecorresponding
entries of alarge
vector(or
matrix).
In the calculation of thesymmetric
matricesB’,
W,
M’ andK’,
it sufficesto calculate the upper
triangle.
Symbolic
factorization and matrixreordering
are notpresent
in table I since these areperformed
only
once before the first function evaluation. Inlarge-scale
applications,
the bulk of the work is in thecomputation
of theCholesky
factorization and the sparse inverse.Using
the sparseinverse,
the work for function andgradient
calculation is about three times the work for function evaluation alone(where
the sparse inverse is notneeded).
Inparticular,
when the number p ofestimated covariance
components
islarge,
theanalytic gradient
takesonly
a smallfraction
2/p
of the time needed for finite differenceapproximations.
Note also that for a combined function and
gradient evaluation, only
two sweepsthrough
the data areneeded,
animportant
asset when the amount of data is so5. ANIMAL BREEDING APPLICATIONS
In this section we
give
a small numericalexample
to demonstrate thesetup
ofvarious
matrices,
andgive
less detailed results on twolarge problems. Many
other animalbreeding problems
have beensolved,
with similaradvantages
for the newalgorithm
as in theexamples given
below[1-3,
19, 38, 49].
5.1. Small numerical
example
Table II
gives
the data used for a numericalexample.
There are in alleight
animals which are listed with their
parent
codes in the first block under’pedigree’.
The first five of them have
measurements,
i.e.dependent
variables listed under’dep
var’. Each animal has two traits measured
except
for animal 2 for which the secondmeasurement is
missing.
Structural information forindependent
variables is listed under’indep
var’. The first column in this block denotes a continuousindependent
variable,
such asweight,
for which aregression
is to be fitted. Thefollowing
columnsare some fixed
effect,
such as sex, a randomcomponent,
such as herd and the animalidentification. Not all effects were fitted for both traits. In
fact,
weight
wasonly
fitted for the first trait as shown
by
the model matrix in table IIZThe
input
data are translated into a series of matricesgiven
in table IV. Toimprove
numericalstability,
dependent
variables are scaledby
their standarddeviation and mean, while the continuous
dependent
variable is shiftedby
its meanSince there is
only
one random effect(apart
from the correlated animaleffect),
the full element formulation
[13]
has threetypes
of modelequations,
each with anindependent
covariance structureC.
Y
.
Measurement elements
(type y
=1):
thedependent
variablesgive
rise totype
= 1 as listed in the second column in table IV. The secondentry
isspecial
in thatit denotes the residual covariance matrix for this record with a
missing
observation.To take care of
this,
a newmtype
is created for eachpattern
ofmissing
values(with
mtype
=type
if no value ismissing)
[20];
i.e. the different values ofmtype
correspond
to the different matricesC!.
However,
it is still based onC
1
asgiven
in table V which lists all
types
in thisexample.
Pedigree
elements(type q
=2):
the next nine rows in table IV aregenerated
from thepedigree
information. With bothparents
known,
three entries aregenerated
in both the address and coefficient matrices. With
only
oneparent
known,
twoaddresses and coefficients are
needed,
whileonly
oneentry
isrequired
if noparent
information is available. For all entries the
type
is -y
= 2 with the covariancematrix
C
2
.
Random effect elements
(type y = 3):
the last four rows in table IV are the entriesdue to random effects which
comprise
three herd levels in thisexample. They
havetype -y
= 3 with the covariance matrixC
3
.
All covariance matrices are 2 x
2,
so thatp = 3 + 3 + 3 = 9 dispersion
parameters
need to be estimated.
The addresses in the
following
columns in table IV are deriveddirectly
fromthe level codes in the data
(table 77)
allocating
oneequation
for each trait withincovariables
only
oneequation
iscreated, leading
to the address of 0 for all fivemeasurements.
The coefficients
corresponding
to the above addresses are stored in another matrix asgiven
in table IV. The entries are 1 for class effects and continuousvariables in the case of
regression
(shifted
by
themean).
The address matrices and coefficient matrices in table IV form a sparse
repre-sentation of the matrix A of
equation
(3)
and can thus be useddirectly
to set up the normalequations.
Note thatonly
one passthrough
the modelequations
isrequired
to handledata,
random effects andpedigree
information.Also,
we wouldlike to
point
out that thisalgorithm
does notrequire
aseparate
treatment of thenumerator
relationship
matrix.Indeed,
the historicproblem
ofobtaining
its inverse iscompletely
avoided with thisapproach.
As an
example
of how to set up the normalequations,
we look at line 12 of table IV(because
it does notgenerate
as many entries as the first fivelines).
Forthe animal labelled T in table
IV,
the variables associated with the two traits have index T + 1 and T + 2. The contributionsgenerated
from line12,
Starting
values for allC
w
for the scaled data were chosenas 3
for all variancesand 0.0001 for all
covariances, amounting
to apoint
in the middle of theparameter
space. With
C
w
specified
as above we have for its inverseOptimization
wasperformed
with a BFGSalgorithm
asimplemented by
Gay
[12].
For the first function evaluation we obtain agradient
given
in table VI witha function value of 17.0053530.
Convergence
was reached after 51 iterations withsolutions
given
in table VII at aloglikelihood
of 15.47599750.5.2. A
large problem
A
large problem
from the area ofpig
breeding
has been used to test animplementation
of the abovealgorithm
in the VCEpackage
[17].
The data setcomprised
26 756 measurement records with six traits. Table IXgives
the number of levels for each effectleading
to 233 796 normalequations.
The columns headedby
’trait’represent
the model matrix(cf.
tableIII)
mapping
the effects on the traits.As can be seen, the statistical model is different for the various traits.
Because traits 1
through
4 and traits 5 and 6 are measured on different animalsno residual covariances can be
estimated,
resulting
in twotypes
la andlb,
with 4 x 4 and 2 x 2 covariance matricesC
la
andC
16
.
Together
with the 6 x 6 covariance matricesC
2
andC
3
forpedigree
effect 9 and random effect8,
respectively,
a totalWe
compared
the finite differenceimplementation
of VCE[17]
with ananalytic
gradient
implementation
based on thetechniques
of thepresent
paper. Anuncon-strained minimization
algorithm
writtenby
Schnabel et al.[44]
thatapproximates
the first derivativesby
finite differences was used to estimate all 55components
simultaneously.
The runperformed
37 021 function evaluations at 111.6 s each ona Hewlett Packard 755 model
amounting
to a total CPU time of 47.8days.
To ourknowledge,
it was the first estimate of more than 50 covariancecomponents
simul-taneously
for such alarge
data set with acompletely
general
model. Factorizationwas
performed
by
a block sparseCholesky algorithm
due toNg
andPeyton
[41].
Using
analytic gradients,
convergence was reached after 185 iterationstaking
13 min
each;
the less efficient factorization from Misztal and Perez-Enciso[39]
wasused here because of the
availability
of their sparse inverse code. An evenslightly
better solution was reached and
only
41 h of CPU time wereused, amounting
toa measured
speed-up
factor ofnearly
28.However,
thisspeed-up
underestimates thesuperiority
ofanalytical gradients
because the factorization used in the Misztal and Perez-Enciso’s code is less efficient thanNg
andPeyton’s
block sparseCholesky
factorization used forapproximating
thegradients by
finite differences.Therefore,
the
following
comparison
will be based on CPU time measurements made on Misztaland Perez-Enciso’s factorization code.
For the above data set the CPU usage of the current
implementation -
which has notyet
been tuned forspeed
(so
the sparse inverse takes three to four timesthe time for the numerical
factorization) -
isgiven
in table X. As can be seen fromthis table
computing
oneapproximated gradient by
finitedifferencing
takes around202.6 * 55 = 11 143 s, while one
analytical gradient
costsonly
around four times theset-up
andsolving
of the normalequations,
i.e. 812 s.Thus,
theexpected
speed-up would be around 14. The 37 021 function evaluations
required
in the run withapproximated
gradients
(which
include some linearsearches)
would have taken 86.8days
with the Misztal and Perez-Enciso code.Thus,
the resultantsuperiority
of ournew
algorithm
isnearly
51 for the model under consideration. This is muchlarger
than the
expected
speed-up
of 14mainly because,
withapproximated
gradients,
673optimization
steps
wereperformed
ascompared
to the 185 withanalytical
gradients.
Such a
high
number of iterations withapproximated gradients
could be observedin many runs with
higher
numbers ofdispersion
variables and can be attributedoptimization
process even aborted whenusing approximated
gradients,
whereasanalytical
gradients yielded
correct solutions.5.3. Further evidence .
Table
XI presents
data on a number of different runs that have beenperformed
with our new
algorithm.
The statistical models used in the datasets varysubstan-tially
and cover alarge
range ofproblems
in animalbreeding.
The newalgorithm
showed the same behaviour also on a
plant
breeding
dataset(beans)
which has aquite
different structure ascompared
to the animal data sets.The datasets
(details
can be obtained from the secondauthor)
cover a wholerange of
problem
sizes both in terms of linear and covariancecomponents.
Accord-ingly,
the number of nonzero elements variessubstantially
from a few ten thousandsup to many millions.
Clearly,
the number of iterations increases with the number ofdispersion
variables with a maximum well below 200. Some of the runs estimatedcovariance matrices with very
high
correlations well above 0.9.Although
this is closeto the border of the
parameter
space it did not seem to slow down convergence, aFor the above datasets the ratio of
obtaining
thegradient
after and relative tothe factorization was between 1.51 and 3.69
substantiating
our initial claim that theanalytical gradient
can be obtained at a smallmultiple
of the CPU time needed tocalculate the function value alone.
(For
thelarge
animalbreeding problem
described in tableX,
this ratio was2.96.)
Sofar,
we have notexperienced
any ratios that wereabove the value of 4. From this we can conclude that with
increasing
numbers ofdispersion
variables ouralgorithm
isinherently
superior
toapproximated gradients
by
finite differences.In
conclusion,
the new version of VCE notonly
computes
analytical
gradients
much faster than the finite difference
approximations
(with
thesuperiority
increas-ing
with the number of covariancecomponents),
but also reduces the number of iterationsby
a factor of aroundthree,
thereby
expanding
the scope of REMLcovari-ance
component
estimation in animalbreeding
modelsconsiderably.
Noprevious
code was able to solve
problems
of the size that can be handled with thisimple-mentation.
ACKNOWLEDGEMENTS
Support by
the H. Wilhelm Schaumann Foundation isgratefully acknowledged.
REFERENCES
[1]
BradeW.,
GroeneveldE., Bestimmung genetischer Populationsparameter
fur dieEinsatzleistung
vonMilchkuhen,
Arch. Anim.Breeding
2(1995)
149-154.[2]
BradeW.,
GroeneveldE.,
Einfluf3 des Produktionsniveaus aufgenetische
Popula-tionsparameter
derMilchleistung
sowie aufZuchtwertschatzergebnisse,
Arch. Anim.Breeding
38(1995)
289-298.[3]
BradeW.,
GroeneveldE., Bedeutung
derspeziellen
Kombinationseignung
in derMilchrinderzuchtung. Zuchtungskunde
68(1996)
12-19.[4]
ChuE.,
George A.,
LiuJ., Ng E.,
SPARSEPAK: Waterloo sparse matrixpackage
user’s
guide
forSPARSEPAK-A,
TechnicalReport CS-84-36, Department
ofCom-puter
Science, University
ofWaterloo, Ontario, Canada,
1984.[5]
CorbeilR.,
SearleS.,
Restricted maximum likelihood(REML)
estimation of variancecomponents in the mixed
model,
Technometrics 18,(1976)
31-38.[6]
Dempster A.,
LairdN.,
RubinD.,
Maximum likelihood fromincomplete
data via the EMalgorithm,
J.Roy.
Statist. Soc. B 39(1977)
1-38.[7]
DucosA.,
BidanelJ., Ducrocq V.,
BoichardD.,
GroeneveldE.,
Multivariatere-stricted maximum likelihood estimation of
genetic
parameters forgrowth,
carcassand meat
quality
traits in FrenchLarge
White and French LandracePigs,
Genet. Sel. Evol. 25(1993)
475-493.[8]
DuffL,
ErismanA.,
ReidJ.,
Direct Methods forSparse Matrices,
Oxford Univ.Press, Oxford,
1986.[9]
ErismanA., Tinney W.,
Oncomputing
certain elements of the inverse of a sparsematrix, Comm. ACM 18
(1975)
177-179.[11]
Fraley C.,
BurnsP., Large-scale
estimation of variance and covariance components, SIAM J. Sci.Comput.
16(1995)
192-209.[12]
Gay D., Algorithm
611 - subroutines for unconstrained minimizationusing
amodel/trust-region
approach,
ACM Trans. Math. Software 9(1983)
503-524.[13]
Gill J., Biases in balanced experiments with uncontrolled randomfactors,
J. Anim. Breed. Genet.(1991)
69-79.[14]
Graser H., SmithS.,
Tier B., A derivative-freeapproach
forestimating
variancecomponents in animal models
by
restricted maximumlikelihood,
J. Anim. Sci. 64(1987)
1362-1370.[15]
GroeneveldE.,
Simultaneous REML estimation of 60 coariance components in ananimal model with
missing
valuesusing
a DownhillSimplex algorithm,
in: 42nd AnnualMeeting
of theEuropean
Association for Animal Production,Berlin,
1991,vol. 1, pp. 108-109.
[16]
GroeneveldE.,
Performance of direct sparse matrix solvers in derivative free REMLcovariance component
estimation,
J. Anim. Sci. 70(1992)
145.[17]
Groeneveld E. REML VCE - a multivariate multimodel restricted maximum likeli-hood(co)variance
component estimationpackage,
in:Proceedings
of an ECSympo-sium on
Application
of Mixed Linear Models in the Prediction of Genetic Merit inPigs,
Mariensee, 1994.[18]
GroeneveldE.,
Areparameterization
toimprove
numericaloptimization
inmultivari-ate REML
(co)variance
component estimation, Genet. Sel. Evol. 26(1994)
537-545.[19]
GroeneveldE.,
BradeW.,
RechentechnischeAspekte
der multivariaten REMLKo-varianzkomponentenschatzung, dargestellt
an einemAnwendungsbeispiel
aus derRinderzuchtung,
Arch. Anim.Breeding
39(1996)
81-87.[20]
GroeneveldE.,
KovacM.,
Ageneralized computing procedure
forsetting
up andsolving
mixed linearmodels,
J.Dairy
Sci. 73(1990)
513-531.[21]
GroeneveldE.,
CsatoL.,
Farkas J., Radnoczi L., Jointgenetic
evaluation of field and station test in theHungarian Large
White and Landracepopulations,
Arch. Anim.Breeding
39(1996)
513-531.[22]
Harville D., Maximum likelihoodapproaches
to variance component estimation andto related
problems,
J. Am. Statist. Assoc. 72(1977)
320-340.[23]
HemmerleW., Hartley H., Computing
maximum likelihood estimates for the mixed A.O.V. modelusing
the Wtransformation,
Technometrics 15(1973)
819-831.[24]
HendersonC.,
Estimation ofgenetic
parameters, Ann. Math. Stat. 21(1950)
706.[25]
HendersonC., Applications
of Linear Models in AnimalBreeding, University
ofGuelph
(1984).
[26]
HendersonC.,
Estimation of variances and covariances undermultiple
traitmodels,
J.Dairy
Sci. 67(1984)
1581-1589.[27]
Jennrich R., Schluchter M., Unbalancedrepeated-measures
models with structured covariancematrices, Biometry
42(1986)
805-820.[28]
JensenJ.,
MadsenP.,
A User’s Guide toDMU,
National Institute of AnimalScience,
Research Center Foulum Box 39, 8830Tjele, Denmark,
1993.[29]
KovacM.,
Derivative free methods in covariance component estimation, Ph.D.thesis,
University
of Illinois,Urbana-Champaign.
[30]
LairdN.,
Computing
of variance componentsusing
the EMalgorithm,
J. Statist.Comput.
Simul. 14(1982)
295-303.[31]
Laird N.,Lange N.,
StramD.,
Measures:Application
of the EMalgorithm,
J. Am. Statist. Assoc. 82(1987)
97-105.[33]
MatstomsP., Sparse QR
factorization inMATLAB,
ACM Trans. Math. Software 20(1994)
136-159.[34]
Meyer K.,
DFREML - a set of programs to estimate variance components under anindividual animal
model,
J.Dairy
Sci. 71(suppl. 2) (1988)
33-34.[35]
Meyer K.,
Restricted maximum likelihood to estimate variance components for animal models with several random effectsusing
a derivative freealgorithm,
Genet.Sel. Evol. 21
(1989)
317-340.[36]
Meyer K., Estimating
variances and covariances for multivariate animal modelsby
restricted maximumlikelihood,
Genet. Sel. Evol. 23(1991)
67-83.[37]
Meyer K.,
Derivative-intense restricted maximum likelihood estimation ofcovari-ance components for animal
models,
in: 5th WorldCongress
on GeneticsApplied
to Livestock
Production, University
ofGuelph,
7-12August 1994,
vol.18, 1994,
pp. 365-369.
[38]
MielenzN.,
GroeneveldE.,
Muller J.,Spilke
J., Simultaneous estimation ofcovari-ances with REML and Henderson 3 in a selected chicken
population,
Br. Poult. Sci35
(1994).
[39]
Misztal L, Perez-Enciso M.,Sparse
matrix inversion for restricted maximum likeli-hood estimation of variance componentsby expectation-maximization,
J.Dairy
Sci.(1993)
1479-1483.[40]
Nelder J., Mead R., Asimplex
method for function minimization,Comput.
J. 7(1965)
308-313.’
[41]
Ng E., Peyton B.,
Block sparseCholesky algorithms
on advanceduniprocessor
computers, SIAM J. Sci.
Comput.
14(1993)
1034-1056.[42]
Patterson H.,Thompson R., Recovery
of inter-block information when block sizes areunequal,
Biometrika 58(1971)
545-554.[43]
RubinD.,
Inference andmissing data,
Biometrika 63(1976)
581-592.[44]
SchnabelR., Koontz J., Weiss B.,
A modular system ofalgorithms
for unconstrainedminimization, Technical
Report
CU-CS-240-82Comp.
Sci. Dept.,University
ofColorado, Boulder,
1982.[45]
Spilke J.,
GroeneveldE., Comparison
of four multivariate REML(co)variance
com-ponent estimation
packages,
in: 5th WorldCongress
on GeneticsApplied
to LivestockProduction, University
ofGuelph,
7-12August 1994,
vol.22,
1994, pp. 11-14.[46]
Spilke J.,
GroeneveldE.,
MielenzN.,
A Monte-Carlostudy
of(co)variance
compo-nent estimation
(REML)
for traits with differentdesign
matrices, Arch. Anim. Breed. 39(1996)
645-652.[47]
Tholen E.,Untersuchungen
von Ursachen undAuswirkungen heterogener
Varianzen der Indexmerkmale in der DeutschenSchweineherdbuchzucht,
SchriftenreiheLand-bauforschung V61kenrode,
Sonderheft 111(1990).
[48]
Thompson R., Wray N., Crump R.,
Calculation ofprediction
error variancesusing
sparse matrix
methods,
J. Anim. Breed. Genet. Ill(1994)
102-109.[49]
Tixier-Boichard M., Boichard D., Groeneveld E., BordasA.,
Restricted maximum likelihood estimates ofgenetic
parameters of adult male and female Rhode Island Red Chickensdivergently
selected for residual feedconsumption,
Poultry
Sci. 74(1995)
1245-1252.APPENDIX 1:
Computing
the sparse inverseA
cheap
way tocompute
the sparse inverse is based on the relationfor the inverse B =
B-
1
.
By comparing
coefficients in the uppertriangle
of thisequation,
noting
that(R-
1
)
ii
=(R
ii
)-
1
,
we find thatwhere
6
ik
denotes the Kroneckersymbol;
henceTo
compute B
ik
from thisformula,
we need to know theB
jk
forall j
> i withRi! !
0. Since the factorization processproduces
asparsity
structure with theproperty
(ignoring
accidental zeros from cancellation that are treated asexplicit
zeros),
onecan
compute
thecomponents
of the inverseB
within thesparsity
pattern
of R
T
+R
by
equation
(A2)
withoutcalculating
any of its entries outside thissparsity
pattern.
If
equation
(A2)
is used in theordering
i = n, n - 1, ....,1,
theonly
additionalspace needed is that for a copy of the
RZ! !
0,
(j
>i),
which must be saved beforewe
compute
theB,!(R,! 7!
0, k >
i)
and overwrite them overR
ik
.
(A
similaranalysis
isperformed
for the Takahashi inverseby
Erisman andTinney
[9],
basedon an
LDL
T
factorization.)
Thus the number of additionalstorage
locations neededis
only
the maximal numbers of nonzeros in a row of R.The cost is a small
multiple
of the cost forfactoring
B,
excluding
thesymbolic
factorization;
theproof
of thisby
Misztal and Perez-Enciso[39]
for the sparse inverse of anLDL
T
factorizationapplies
almost withoutchange.
APPENDIX 2: Derivation of the
algorithm
in table IFor the derivative with
respect
to a variable that occurs inC
o
y
only,
equation
(15)
implies
that(The
computation
ofC
o
y
is addressedbelow.)
Using
the notation[... ],
for the vth(a
consequence of theProposition)
the formulahence
i ik
with the
symmetric
matricesTherefore,
Up
to thispoint,
thedependence
of the covariance matrixC
o
y
onparameters
wasarbitrary.
For animplementation,
one needs to decide on theindependent
parameters
in which to express the covariance matrices. We made thefollowing
choice in ourimplementation, assuming
that there are no constraints on theparametrization
of theC,y;
other choices can be handledsimilarly,
with a similarcost
resulting
for thegradient.
Ourparameters
are, for eachtype
-y, the nonzeroentries of the
Cholesky
factorL,y
ofC
o
y,
definedby
theequation
together
with the conditionssince this
automatically
guarantees
positive
definiteness.(In
thelimiting
case, wherea block of the true covariance matrix is semidefinite
only,
this will be revealed in the minimizationprocedure
by
converging
to asingular
L.y
while eachcomputed
L
o
y
is stillnonsingular.)
We now consider derivatives
with
respect
to theparameter
where -y is one of the