Restricted maximum likelihood estimation for animal models using derivatives of the likelihood

(1)

HAL Id: hal-00894119

https://hal.archives-ouvertes.fr/hal-00894119

Submitted on 1 Jan 1996

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Restricted maximum likelihood estimation for animal

models using derivatives of the likelihood

K Meyer, Sp Smith

To cite this version:

(2)

Original

article

Restricted maximum likelihood

estimation for animal models

_using

derivatives

of

the likelihood

K

_Meyer,

SP Smith

Animal Genetics and

_{Breeding Unit, University of}

New

_{England, Armidale,}

NSW

_2351,

Australia

(Received

21 March

_{1995; accepted}

9 October

₁₉₉₅₎

Summary - Restricted maximum likelihood estimation

_using

first and second derivatives of the likelihood is described. It relies on the calculation of derivatives without the need

for

_large

matrix inversion

_using

an automatic differentiation

_procedure.

In _essence,this

is an extension of the

Cholesky

factorisation of a matrix. A

_{reparameterisation}

is used

to transform the constrained

_optimisation

_{problem imposed}

in

_estimating

covariance components to an unconstrained

problem,

thus

making

the use of

Newton-Raphson

and related

algorithms

feasible. A numerical

example

is

_given

to illustrate calculations. Several modified

_{Newton-Raphson}

and method of

_{scoring algorithms}

are

_compared

for

applications

to

_analyses

of beef cattle

_data,

and contrasted to a derivative-free

_algorithm.

restricted maximum likelihood

_/

derivative

_/

_algorithm

_/

variance _component esti-mation

Résumé - Estimation du maximum de vraisemblance restreinte pour des modèles in-dividuels par dérivation de la vraisemblance. Cet article décrit une méthode d’estimation

du maximum de vraisemblance restreinte utilisant les dérivées

_première

et seconde de la vraisemblance. La méthode est basée sur une

_procédure

de

_{différenciation}

automatique ne

nécessitant pas l’inversion de

grandes

matrices. Elle constitue en

_fait

une extension de la

décomposition

de

_{Cholesky appliquée}

à une matrice. On utilise un

_paramétrage

_qui

trans-forme

le

_{problème d’optimisation}

avec _{contrainte que}soulève l’estimation des _composantes

de variance en un

_problème

sans _contrainte,ce _quirend

_possible

l’utilisation

_{d’algorithmes}

de

_{Newton-Raphson}

ou

_apparentés.

Les calculs sont illustrés sur un

_{exemple numérique.}

Plusieurs

_algorithmes,

de type

Newton-Raphson

ou selon la méthode des _{scores, sont}

ap-pliqués

à

_l’analyse

de données sur bovins à viande. Ces

_algorithmes

sont

_comparés

entre

eu! _{et par}ailleurs

comparés

à un

algorithme

sans dérivation.

maximum de vraisemblance restreinte

_/

dérivée

_/

_algorithme

_/

estimation de

composante de variance *

(3)

INTRODUCTION

Maximum likelihood estimation of

_(co)variance

_components

_generally

_requires

the numerical solution of a constrained nonlinear

_optimisation

problem

(Harville,

1977).

Procedures to locate the minimum or maximum of a function are classified

_according

to the amount of information from derivatives of the function

utilised;

see, for

instance,

Gill et al

_(1981).

Methods

_using

both first and second derivatives are

fastest to _{converge, often}

_{showing quadratic}

_convergence,while search

_algorithms

not

_relying

on derivatives are

generally

slow, ie, require

_manyiterations and function evaluations.

Early applications

of restricted maximum likelihood

_(REML)

estimation to

animal

_breeding

data used a Fisher’s method of

scoring

type

algorithm, following

the

_original

paper

by

Patterson and

_Thompson

₍₁₉₇₁₎

and

_Thompson

_(1973).

This

_requires

_expected

values of the second derivatives of the likelihood to be

evaluated,

which

proved computationally

highly demanding

for all but the

_simplest

analyses.

Hence

_{expectation-maximization}

_(EM)

_type

_{algorithms gained popularity}

and found

_widespread

use for

_{analyses fitting}

a sire model.

_Effectively,

these use first derivatives of the likelihood function.

_Except

for

_special

cases,

however,

they

required

the inverse of a matrix of size

_equal

to the number of random effects

_fitted,

eg, number of sires times number of

traits,

which

_severely

limited the size of

_analyses

feasible.

For

_analyses

under the animal

_model,

Graser et al

₍₁₉₈₇₎

thus

_proposed

a

derivative-free

_algorithm.

This

_only

requires

factorising

the coefficient matrix of the mixed-model

_equations

rather than

_{inverting it,}

and can be

_{implemented efficiently}

using

sparse matrix

techniques. Moreover,

it is

readily

extendable to animal models

including

additional random effects and multivariate

_analyses

_(Meyer,

_1989,

_1991).

Multi-trait animal model

analyses

fitting

additional random effects

_using

a

derivative-free

_algorithm

have been shown to be feasible.

_However,

_they

are

com-putationally highly demanding,

the number of likelihood evaluations

_required

in-creasing

exponentially

with the number of

_(co)variance

_components

to be estimated

simultaneously.

Groeneveld et al

_(1991),

for

_instance,

_reported

that 56 000 evalu-ations were

_required

to reach a

change

in likelihood smaller than

10-

7

when

esti-mating

60 covariance

_components

for five traits. While

_judicious

choice of

_starting

values and search

_strategies

_(eg,

_temporary

maximisation with

_respect

to a subset

of the

parameters

only)

together

with

_exploitation

of

_special

features of the data

structure

_might

reduce demands

_markedly

for individual

_analyses,

it remains true that derivative-free maximisation in

_high

dimensions _{is very}slow to _converge.

This makes a case for REML

_algorithms

_using

derivatives of the likelihood

for

_{multivariate,}

multidimensional animal model

_analyses.

Misztal

₍₁₉₉₄₎

_recently

presented

a

_comparison

of rates of _convergenceof derivative-free and derivative

algorithms, concluding

that the latter had the

_potential

to be faster in almost all _{cases, in}

_particular

that their _{convergence rate}

_depended

little on the number of traits considered.

_Large-scale

animal model

_{applications using}

an EM

_type

algorithm

(Misztal, 1990)

or even a method of

_scoring

_algorithm

_{(Ducrocq, 1993)}

have been

_{reported, obtaining}

the

_large

matrix inverse

_(or

its

_trace)

_{required by}

(4)

REML estimation under an animal model

_using

first and second derivatives of the

likelihood

_{function, computed}

without

inverting large

matrices.

DERIVATIVES OF THE LIKELIHOOD

Consider the linear mixed model

where y,

b,

u and e denote the vectors of

_{observations,}

fixed

_effects,

random effects and residual errors,

respectively,

and X and Z are the incidence matrices

pertaining

to b and u. Let

V(u)

=

G,

V(e)

= R and

Cov(u,e’)

=

_0,

_so that

_V(y)

= V =

ZGZ’

₊R.

Assuming

a multivariate normal

distribution, ie,

y -

N(Xb, V),

the

log

of the REML likelihood

(G)

is

(eg,

Harville,

1977)

where X* denotes a full-rank submatrix of X.

REML

_algorithms

_using

derivatives have

_generally

been derived

_{by differentiating}

!2!.

However,

as outlined

_previously

(Graser

et

_{al, 1987; Meyer,}

_1989),

_{log L}

can be

rewritten as

where C is the coefficient matrix in the mixed-model

_equations

_(MME)

_pertaining

to

_[1]

_(or

a full rank submatrix

thereof),

and P is a

matrix,

Alternative forms of the derivatives of the likelihood can then be obtained

by differentiating

[3]

instead of

_!2!.

Let 0 denote the vector of

_parameters

to be estimated with elements

9z ,

i =

_{1, ... ,}

_p.The first and second

partial

derivatives of the

_log

likelihood are then

Graser et al

₍₁₉₈₇₎

show how the last two terms in

_!3!,

_log

_ICI

and

_y’Py,

can

be evaluated in a

_general

_wayfor all models of form

[1]

by carrying

out a series

of Gaussian elimination

_steps

on the coefficient matrix in the MME

_augmented

_by

the vector of

_right-hand

sides and a

quadratic

in the data vector.

_Depending

on the

model of

_analysis

and structure of G and

_R,

the other two terms

_required

in

_!3!,

(5)

1991),

generally

requiring only

matrix

_{operations proportional}

to the number of traits considered. Derivatives of these four terms can be evaluated

_analogously.

Calculating

logiC!

and

_y’Py

and their derivatives

The mixed-model matrix

_(MMM)

or

_augmented

coefficient matrix

_pertaining

to

[1]

is

where r is the vector of

_right-hand

sides in the MME.

Using general

matrix

_results,

the derivatives of

_log

_{C !}

are

Partitioned matrix results

_give

_log

_IMI

=

log

!C!

₊

log(y’Py),

ie,

(Smith, 1995)

This

_gives

derivatives

Obviously,

these

_expressions

_{([7], [8], [10]}

and

_[11])

_involving

the inverse of the

_large

matrices M and C are

_{computationally}

intractable for _anysizable animal model

_analysis.

However,

the Gaussian elimination

procedure

with

diagonal

pivoting

advocated

_by

Graser et al

₍₁₉₈₇₎

is

_only

one of several _waysto ’factor’

a matrix. An alternative is a

_Cholesky

_{decomposition.}

This lends itself

_readily

to the solution of

_large

_positive

definite

_systems

of linear

_equations

_using

sparse

matrix

_storage

schemes.

_Appropriate

Fortran routines are

_given,

for

_instance,

_by

George

and Liu

₍₁₉₈₁₎

and have been used

_successfully

in derivative-free REML

(6)

for j

>

_i)

denote the

_Cholesky

factor of

_{M, ie,}

M = LL’. The determinant of _a

triangular

matrix is

_simply

the

_product

of its

_diagonal

elements.

_Hence,

with M

denoting

the size of

_M,

and

_from

_I

Smith

₍₁₉₉₅₎

describes

_algorithms,

outlined

below,

which allow the derivatives of the

_Cholesky

factor of a matrix to be evaluated while

_carrying

out the

_{factorisation,}

provided

the derivatives of the

_original

matrix are

specified.

Differentiating

[13]

and

_[14]

then

_gives

the derivatives of

_log

_ICI

and

_y’Py

as

simple

functions of the

_diagonal

elements of the

_Cholesky

matrix and its derivatives.

Calculating

logIRI

and its derivatives

Consider a multivariate

_analysis

for _qtraits and let y be ordered

according

to traits within animals.

_Assuming

that error covariances between measurements on

different animals are _zero,R is

_{blockdiagonal}

for

_animals,

where is N the number of animals which have

_records,

_{and E}

₊

denotes the direct matrix sum

_(Searle,

_1982).

Hence

_log

_IRI

as well as its derivatives can be determined

by

considering

one animal at a time.

(7)

combinations of traits recorded

_(assuming

_single

records _per

_trait),

eg, W = 3 for

q = 2 with combinations trait 1

only,

trait 2

only

and both traits. For animal i which has combination of traits w,

R

i

is

equal

to

_E

_w

_,

the submatrix of E obtained

by deleting

rows and columns

_pertaining

to

_missing

records. As outlined

_{by Meyer}

(1991),

this

_gives

where

_N

_w

_represents

the number of animals

_having

records for combination of traits

w.

_{Correspondingly,}

Consider the case where the

parameters

to be estimated are the

_(co)variance

components

due to random effects and residual errors

(rather

_than,

for

_example,

p

heritabilities and

_{correlations),}

so that V is linear in

0,

ie,

V =

£

9 j0V/

0 9z.

i=l

Defining

with elements

d

kL

=

_1,

if the klth element

ofE

w

is

equal

_to

9z ,

and

dk!

= 0

otherwise,

this then

_gives

Let

_e!

denote the rsth element of

_Ew

_l

_.

_{For ()}

_i

= _e_kl and

9 j

=

e&dquo;, n

,

[23]

and

[24]

then

_simplify

to

(8)

For

q = 1 and R =

(

T

2j,

[25]

and

[26]

become

N

QE

2

and

-N(j

E4

,

respectively

(for

o

i

= oj

=

U2

E ).

_Extensions_{for models with}

_repeated

records are

straightforward.

Hence,

once the inverses of the matrices of residual covariances for all combination

of numbers of traits recorded

_occurring

in the data have been obtained

_(of

maximum

size

_equal

to the maximum number of traits recorded _per

_animal,

and also

required

to set _upthe

_MMM),

evaluation of

_log

_!R!

and its derivatives

requires

only

scalar

manipulations

in addition.

Calculating

loglGI

and its derivatives

Terms

_arising

from the covariance matrix of random

_{effects, G,}

can often be

determined in a similar _way,

exploiting

the structure of G. This

depends

on the

random effects fitted.

Meyer

(1989, 1991)

describes

_log

_IGI

for various cases.

Define T with elements

_t

_ij

of size rq x _rqas the matrix of covariances between

random effects where r is the number of random factors in the model

(excluding

e).

For

_{illustration,}

let u consist of a vector of animal

_genetic

effects a and some uncorrelated additional random

_effect(s)

c with

_N

_c

levels _per

trait, ie,

u’ =

(a’c’).

In the

_simplest

_case,a consists of the direct additive

genetic

effects for each animal

and

trait, ie,

it has

_length

_qN

_A

where

_N

_A

denotes the total number of animals in the

_{analysis, including}

parents

without records. In other cases, a

might

include a

second

_genetic

effect for each animal and

_trait,

such as a maternal additive

_genetic

effect,

which _maybe correlated to the direct

_genetic

effects. An

_example

for c is a common environmental effect such as a litter effect.

With a and c

_{uncorrelated,}

T can be

partitioned

into

corresponding diagonal

blocks

T

A

and

_T

_C

_,

so that

where A is the numerator

relationship

between

_{animals, F,}

often assumed to be the

_identity

_matrix,

describes the correlation structure

_amongst

the levels of c, and

x denotes the direct matrix

_product

_{(Searle, 1982).}

This

_gives

_(Meyer,

₁₉₉₁₎

Noting

that all

₈

₂

_T/8()i8()j

= 0

(for

V linear in

0),

derivatives are

where

DA

=

8T

A/

a9

i

and

D!

=

8T

c/

8()

i

are

_again

matrices with elements 1

if tkl =

()

i

and zero otherwise. As

above,

all second derivatives

for O

i

and

9 j

not

pertaining

to the same random factor

(eg, c)

or two correlated factors

_(such

as

direct and maternal

genetic

effects)

are zero.

Furthermore,

all derivatives of

log

G ) I

with

_respect

to residual covariance

components

are zero.

Further

_{simplifications analogous}

to

_[25]

and

_[26]

can be derived. For

instance,

(9)

random effects

_(r

=

1),

T is the matrix of additive

_genetic

covariances

ai! with

i, j = 1, ... , q. For O

i

= _ax!

and O

j

=

a mn

, this

gives

with ars

_denoting

the rsth element of

T-

1 .

_For

_{q = 1}and _all=

QA

,

[31]

and

[32]

reduce to

_N

_A

_(jA

₂

and

_-N

_A

_(jA

₄

_,

_{respectively.}

Derivatives of the mixed model matrix

As

_emphasised

_above,

calculation of the derivatives of the

_Cholesky

factor of M

_requires

the

_{corresponding}

derivatives of M to be evaluated.

_Fortunately,

these have the same structure as M and can be evaluated while

_setting

_up

_M,

_replacing

G and R

_by

their derivatives.

For O

i

_{and O}

_j

equal

to residual

_{(co)variances,}

the derivatives of M are of the

form

with

Q

R

standing

in turn for

and

for first and second

_{derivatives, respectively.}

As outlined

_above,

R is

blockdiagonal

for animals with submatrices

E

w

.

Hence,

matrices

_Q

_R

have the same structure with submatrices

and

_(for

V linear in 0 so that

é

P

R/8()/}()j =

0)

Consequently,

the derivatives of M with

_respect

to the residual

_{(co)variances}

can

be _{set up}in the same _wayas the ’data

part’

of M. In addition to

_calculating

the

matrices

_Ew

_l

for the W combination of records per animal

occurring

in the

data,

all derivatives of the

_E-

₁

for residual

_components

need to evaluated. The extra calculations

_{required, however,}

are

_{trivial, requiring}

matrix

_{operations proportional}

to the maximum number of records _peranimal

_only

to obtain the terms in

_[36]

(10)

Analogously,

for O

i

and O

j

equal

to elements of

_T,

derivatives of M are

with

_Q

_G

standing

for

for first

_derivatives,

and

for second derivatives.

As

_above,

further

_{simplifications}

are

possible depending

on the structure of G. For

_instance,

for G as in

_[27]

and

_[j

₂

_G/å()

_J

_}()j

=

0,

and

_Qo

₂

₌

Expected

values of second derivatives of

_logc

Differentiating

[2]

gives

second derivatives of

_logc

with

_expected

values

_{(Harville, 1977)}

Again,

for V linear in

_0,

_(9’VlaOiaO

_j

= 0. From

[5]

and

noting

that

aPla0

i

=

(11)

Hence, expected

values of the second derivatives are

_essentially

_(sign

ignored)

equal

to the observed values minus the contribution from the

data,

and thus can be

evaluated

_analogously.

With second derivatives of

_y’Py

not

_{required, computational}

requirements

are reduced somewhat as

_only

the first M — 1 rows of

82M/8()i8()j

need to be evaluated and factored.

AUTOMATIC DIFFERENTIATION

Calculation of the derivatives of the likelihood as described above relies on the

fact that the derivatives of the

_Cholesky

factor of a matrix can be obtained

’automatically’, provided

the derivatives of the

_original

matrix can be

specified.

Smith

₍₁₉₉₅₎

describes a so-called forward

_{differentiation,}

which is a

straight-forward

_expansion

of the recursions

_employed

in the

_Cholesky

factorisation of a

matrix M.

_Operations

to determine the latter are

_typically

carried out

sequentially

by

rows. Let

_L,

of size

_N,

be initialised to M.

_First,

the

_pivot

_(diagonal

element which must be

_greater

than an

_operational

_zero)

is selected for the current row k.

Secondly,

the

off-diagonal

elements for the row

(’lead column’)

are

adjusted

( L

_jk

for j

= k ₊

1, ... ,

N),

and

thirdly

the elements in the

remaining

_part

of L

(L2!

for

j

=

k+1, ... ,

N and i =

_{j, ... ,}

N)

are modified

_(’row

operations’).

After all N rows have been

_processed,

L contains the

_Cholesky

factor of M.

Pseudo-code

_given

_by

Smith

₍₁₉₉₅₎

for the calculation of the

_Cholesky

factor and its first and second derivatives is summarised in table I. It can be seen that the

operations

to evaluate a second derivative

_require

the

_respective

elements of

the two

_{corresponding}

first derivatives. This

_imposes

severe constraints on the

memory

requirements

of the

_algorithm.

While it is most efficient to evaluate the

Cholesky

factor and all its derivatives

_together,

considerable _spacecan be saved

_by

computing

the second derivatives one at a time. This can be done

_{by holding}

all

the first derivatives in memory, or, if core _spaceis the

_limiting

factor, storing

first

derivatives on disk

_(after

_evaluating

them

_individually

as

_well)

and

_reading

in

_only

the two

_{required. Hence,}

the minimum memory

requirement

for REML

_using

first and second derivatives is 4 x

_{L, compared}

to L for a derivative-free

_algorithm.

Smith

₍₁₉₉₅₎

stated

_{that, using}

forward

_{differentiation,}

each first derivative

required

not more than twice the work

required

to evaluate

_log

G

_only,

and that the work needed to determine a second derivative would be at most four times that

to calculate

log

G.

In

_addition,

Smith

₍₁₉₉₅₎

described a ’backward differentiation’

_scheme,

so

named because it reverses the order of

_steps

in the forward differentiation. It is

applicable

for cases where we want to evaluate a scalar function of

_L,

f (L),

in our case

_{log I C + y’Py}

which is a function of the

diagonal

elements of L

(see

[13]

and

!14!).

It

_{requires computing}

a

(lower

triangular)

matrix W

_which,

on

_completion

of

the backward

differentiation,

contains the derivatives of

_{f (L)}

with

respect

to the elements of M. First derivatives of

_{f (L)}

can then be evaluated one at a time as

tr(W 8M/ 8()r)’

The

pseudo-code given

by

Smith

₍₁₉₉₅₎

for the backward differentiation is shown in table II. Calculation of W

_requires

about twice as much work as one likelihood

evaluation, and,

once W is

_evaluated,

_calculating

individual derivatives

_(step

3 in

(12)

(13)

differentiation

_requires

_only

somewhat more work than calculation of one derivative

by

forward differentiation. Smith

₍₁₉₉₅₎

also described the calculation of second derivatives

_by

backward differentiation

_(pseudo-code

not shown

_here).

_Amongst

other

_{calculations,}

this involves one evaluation of a matrix W as described

_above,

for each

parameter

and

_requires

another work array of size L in addition to _space to store at least one matrix of derivatives of M. Hence the minimum memory

requirement

for this

_algorithm

is 3 x _{L + M}

_(M

and L

_{differing by}

the fill-in created

_during

the

_{factorisation).}

Smith

₍₁₉₉₅₎

claimed that the total work

_required

to evaluate all second derivatives for p

parameters

was no more than

6p

times that for a likelihood evaluation.

MAXIMISING THE LIKELIHOOD

Methods to locate the maximum of the likelihood function in the context of

variance

_component

estimation are

_reviewed,

for

_instance,

_by

Harville

(1977)

and

Searle et al

_(1992;

Chapter

8).

Most utilise the

_gradient

vector,

ie,

vector of first derivatives of the likelihood

function,

to determine the direction of search.

Using

second derivatives

One of the oldest and most

_widely

used methods to

_optimise

a non-linear function

is the

_{Newton-Raphson}

_(NR)

_algorithm.

It

_requires

the Hessian matrix of the

function,

ie,

the matrix of second

partial

derivatives of the

_(log)

likelihood with

(14)

where

H’

=

{å2log£/å()iå()j}

and g’

=

10logLI,90

il

are the Hessian matrix

and

_gradient

vector of

_log

_£

_,

_{respectively,}

both evaluated at 0 =

0!.

While the NR

_algorithm

can be

_quick

to _{converge, in}

_particular

for functions

resembling

a

quadratic function,

it is known to be sensitive to poor

starting

values

_{(Powell, 1970).}

Unlike other

_algorithms,

it is not

_guaranteed

_{to converge}

_{though global}

_convergence has been shown for some cases

_using

iterative

_partial

maximisation

_(Jensen

et

_al,

1991).

In

_practice,

so-called extended or modified NR

algorithms

have been found to

be more successful. Jennrich and

Sampson

(1976)

suggested

step

halving,

applied

successively

until the likelihood is found to

_increase,

to avoid

_{’overshooting’.}

More

generally,

the

change

in estimates for the tth iterate in

_[46]

is

_given

_by

for the extended

NR,

B’

=

—T!(H!) !,

where

T

’

is a

step-size

scaling

factor. The

optimum

for

T

t

can

be determined

_readily

as the value which results in the

largest

increase in

_{likelihood, using}

a one-dimensional maximisation

technique

(Powell,

1970).

This relies on the direction of search

_{given by}

H-

l

g

generally

being

a ’good’

direction and

that,

for -H

positive definite,

there is

_always

a

_step-size

which will

increase the likelihood.

Alternatively,

the use of

has been

_suggested

_{(Marquardt, 1963)}

to

_improve

the

_performance

of the NR

algorithm.

This results in a

_step

intermediate between a NR

_step

(!

=

0)

and

a method of

_steepest

ascent

_{step (}

_K

_large).

_{Again, !}

can be chosen to maximise the

increase in

_{log G, though}

for

_large

values of K the

step

size is

_small,

so that there is no need to include a search

_step

in the iteration

(Powell, 1970).

Often

_expected

values of the second derivatives of

_log

_G

are easier to calculate than the observed values.

_{Replacing -H}

by

the information matrix

results in Fisher’s method of

scoring

(MSC).

It can be extended or modified in the

same _wayas the NR

algorithm

(Harville, 1977).

Jennrich and

Sampson

(1976)

and

Jennrich and Schluchter

₍₁₉₈₆₎

compared

NR and

_{MSC, showing}

that the MSC was

generally

more robust

against

a _poorchoice of

_starting

values than the

NR,

though

it tended to

_require

more iterations.

They

thus recommended a scheme

_using

the

MSC

initially

and

_switching

to NR after a few rounds of iteration when the increase

in

_log

G between

steps

was less than one.

Using

first derivatives

only

Other

methods,

so-called variable-metric or

Quasi-Newton procedures, essentially

use the same

_strategies,

but

replace

B

by

an

_{approximation}

of the Hessian matrix.

(15)

An

_interesting

variation has

_recently

been

_presented

_by

Johnson and

Thompson

(1995).

Noting

that the observed and

expected

information were of

_{opposite sign}

and differed

_{only by}

a term

_involving

the second derivatives of

_{y’Py (see}

_[5]

and

[45]),

they

suggested

using

the _averageof observed and

_expected

information

_(AI)

to

_approximate

the Hessian matrix. Since it

requires only

the second derivatives of

y’Py

to be

evaluated,

each iterate is

_{computationally considerably}

less

_demanding

than a ’full’ NR or MSC

_step,

and the same modifications as described above for the

NR

_(see

_[47]

and

_[48])

can be

_applied.

Initial

_comparisons

of the rate _{of convergence} and

_computer

time

_required

showed the AI

_algorithm

to be

_highly

_advantageous

over both derivative-free and

EM-type

procedures

(Johnson

and

Thompson,

1995).

Constraining

parameters

All the

_Newton-type

_algorithms

described above

perform

an unconstrained

optimisation.

To

_estimating

_(co)variance

_components,

_however,

we

_require

variances

be

_{non-negative,}

correlations to be in the range from -1 to 1

and,

for more than

two

_traits,

for them to be consistent with each

_other;

more

generally,

estimated

covariance matrices need to be

_(semi-)

_positive

definite. As shown

_by

Hill and

Thompson

(1978),

the

_probability

of

obtaining

parameter

estimates out of bounds

depends

on the

magnitude

of the correlations

(population values)

and increases

rapidly

with the number of traits

_considered,

in

_particular

for

_genetic

covariance

matrices.

For univariate

_analyses,

a common

approach

has been to set

_negative

estimates of variance

_components

to zero and continue

_iterating.

Partial

_sweeping

of B to handle

_{boundary constraints, monitoring}

and

_restraining

the size of

_pivots

relative to the

_{corresponding}

_(original)

diagonal

elements has been recommended

_(Jennrich

and

Sampson, 1976;

Jennrich and

_Schluchter,

_1986).

Harville

_(1977;

section

_6.3)

dis-tinguished

between three

_types

of

_techniques

to

_modify

unconstrained

_optimisation

procedures

to accommodate constraints on the

_parameter

_space.

Firstly, penalty

techniques

operate

on a function which is close to

log L

_except

at the boundaries

where it assumes

_large

_negative

values which

effectively

serve as a

_barrier,

_deflecting

a further search in that direction.

Secondly, gradient

projection techniques

are

suit-able for linear

inequality

constraints.

_Thirdly,

it _maybe feasible to transform the

parameters

to be estimated so that maximisation on the new scale is unconstrained.

Box

₍₁₉₆₅₎

demonstrated for several

examples

that the

_{computational}

effort to

solve constrained

problems

can be reduced

_{markedly by}

_eliminating

constraints so

that one of the more

powerful

unconstrained

methods,

preferably

with

quadratic

convergence, can be

_employed.

For univariate

_analysis,

an obvious _wayto eliminate

non-negativity

constraints is to maximise

_{log L}

with

_respect

to standard deviations and to _squarethese on _convergence

(Harville,

1977).

Alternatively,

we could

estimate

_logarithmic

values of variance

_components

instead of the variances. This

seems

_preferable

to

_taking

_squareroots

_where,

on

_{backtransforming,}

a

_largish

negative

estimate

_might

become a substantial

_positive

estimate of the

corresponding

variance. Harville

₍₁₉₇₇₎

cautioned however that such transformations may result in the introduction of additional

_{stationary points}

on the likelihood

surface,

and

thus should be used

_only

in

_conjunction

with

_{optimisation techniques ensuring}

an

(16)

Further motivation for the use of transformations has been

_provided

_by

the

scope to reduce

_{computational}

effort or to

_improve

_convergence

_by

_making

the

shape

of the likelihood function on the new scale more

_quadratic.

For multivariate

analyses

with one random effect and

_equal

_design

_matrices,

for

_instance,

a canonical

transformation allows estimation to be broken down into a series of

_{corresponding}

univariate

_analyses

_{(Meyer, 1985);}

see Jensen and Mao

₍₁₉₈₈₎

for a review. Harville and Callanan

₍₁₉₉₀₎

considered different forms of the likelihood for both

NR and MSC and demonstrated how

_they

affected _convergencebehaviour. In

particular,

a ’linearisation’ was found to reduce the number of iterates

_required

to reach _convergence

_{considerably. Thompson}

and

_Meyer

₍₁₉₈₆₎

showed how a

reparameterisation

aimed at

_making

variables less correlated could

_speed

up an

expectation-maximisation

algorithm dramatically.

For univariate

_analyses

of

_repeated

measures data with several random

_effects,

Lindstrom and Bates

₍₁₉₈₈₎

_suggested

maximisation of

_{log £}

with

_respect

to the non-zero elements of the

_{Cholesky decomposition}

of the covariance matrix of random effects in order to remove constraints on the

parameter

space and

to

_{improve stability}

of the NR

algorithm.

In

_{addition, they}

chose to estimate

the error variance

directly,

_operating

on the

_profile

likelihood of the

_remaining

parameters.

For several

_examples,

the authors found consistent _convergenceof the

NR

_algorithm

when

implemented

this _way,even for an

_{overparameterised}

model.

Recently,

Groeneveld

₍₁₉₉₄₎

examined the effect of this

_{reparameterisation}

for

large-scale

multivariate

_analyses

_using

derivative-free

_{REML, reporting}

substantial

improvements

in

_speed

of _convergencefor both direct search

_{(downhill Simplex)}

and

_Quasi-Newton

_algorithms.

IMPLEMENTATION

Reparameterisation

For our

analysis,

_parameters

to be estimated are the non-zero elements of the lower

triangle

of the two covariance matrices E and

_T,

with T

_potentially

_consisting

of

independent diagonal

blocks,

depending

on the random effects fitted. Let

The

_{Cholesky decomposition}

of U has the same structure

Estimating

the non-zero elements of matrices

_L

_u.

then ensures

_positive

definite matrices

_U

_r

on

_transforming

back to the

_original

scale

_(Lindstrom

and

Bates,

1988).

However,

for the

_U

_r

_representing

covariance

_matrices,

the ith

_diagonal

element of

L

(17)

is

_suggested

to

_apply

a

_{secondary transformation, estimating}

the

logarithmic

values

of these

_diagonal

elements and

_thus,

as discussed above for univariate

_analyses

of

variance

_components,

_{effectively forcing}

them to be

_greater

than zero.

Let v denote the vector of

parameters

on the new scale. In order to maximise

_{log G}

with

_respect

to the elements of _v,we need to transform its derivatives

_accordingly.

Lindstrom and Bates

₍₁₉₈₈₎

describe

_briefly

how to obtain the

_gradient

vector and Hessian matrix for a

_Cholesky

matrix

_L

_Ur

from those for the matrix

U

r

(see

also

corrections,

Lindstrom and Bates

_(1994)).

For the ith element of v,

More

_generally,

for a one-to-one transformation

_{(Zacks, 1971)}

where J with elements

_8()d8v

_j

is the Jacobian of 9 with respect to v.

Similarly,

with the first

part

of

_[53]

_equal

to the ith element of

_{(0J’/0vj )(0 10g £/09k)}

_and

the second

_part

_equal

to

_ijth

element of

_{J’ {}

_ô

_{2log L/ Ô()}

_i

_Ô()

_j

_}J.

Consider one covariance matrix

_U

_r

at a

_time,

dropping

the

subscript

r for

convenience,

and let u,t and

l

wx

denote the elements of U and

L

U

,

respectively.

From U =

LL’,

it follows that

where

_{min(s, t)}

is the smaller value of s and t.

_Hence,

the

_ijth

element of J for

()i

= _Ust

and v

j

=

l

wx

is

For

_{w # x}

and

s,

t ! x, this is non-zero

_only

if at least one of s and t is

_equal

to

(18)

cases to consider :

For

_example,

for _q= 3 traits and six elements in B

(ull,

_u_lz_,_U13_,_uzz,_U23 and

U33

)

and v

_(log(lil),

l

zi

,

l

si

,

),

log(<

22

1

32 ,

log(l

33 ))

Similarly,

the

_ijth

element of

_}

_j

_iåv

_V

_å

_k/

₍₎

₂

_{å

_(see

_[53])

for

9 k

=

u, t

, vi =

l

wx

and

1/! =

ly

z

is

Allowing

for the additional

_adjustments

due to the

_log

transformation of

diago-nals,

[56]

is non-zero in five cases

(for

_{w 7!}

t

_!

x and

_{x, t <}

w):

For the above

_example,

this

_gives

the first two derivatives of J

In some _cases,a covariance

_component

cannot be

_estimated,

for

_instance,

the error covariance between two traits measured on different sets of animals. On the

(19)

new

_parameter

_maybe non-zero. This can be accommodated

by, again,

not

estimating

this

parameter

but

_calculating

its new value from the estimates of the

other

_parameters,

similar to the way a correlation is fixed to a certain value

_by

only

estimating

the

respective

variances and then

_calculating

the new covariance

’estimate’ from them.

For

_example,

consider three traits with the covariance between traits 2 and

3,

U23, which are not estimable.

Setting

u23 = 0

gives

a non-zero value on

the

_{reparameterised}

scale of

!32

=

-1

21

1

31 /1

22 .

Only

the other five

parameters

(lii, 1

21 , 131, 122,

and

_l

₃₃

₎

are then estimated

(ignoring

the

log

transformation of

diagonals

here for

_simplicity)

while an ’estimate’ of

l

32

is calculated from those of

1

21 ,

l

31

and

_l

₃₃

_according

to the

_relationship

above. On

transforming

back to the

original

scale,

this ensures that the covariance between traits 2 and 3 remains fixed

at zero.

Sparse

matrix

storage

Calculation of

_{log £}

for use in derivative-free REML estimation under an animal model has been made feasible for

practically

sized data sets

_through

the use of

sparse matrix

storage

techniques.

These

_{adapt readily}

to include the calculation of derivatives of L. One

_possibility

is the use of so-called linked lists

(see,

for

_instance,

Tier and Smith

_(1989)).

_George

and Liu

₍₁₉₈₁₎

describe several other

_strategies,

using a ’compressed’

storage

scheme when

applying

a

_{Cholesky decomposition}

to a

large symmetric, positive definite,

sparse matrix.

Elements of the matrices of derivatives of L are subsets of elements of

L, ie,

exist

only

for non-zero elements of L. Thus the same

_system

of

_pointers

can be used for L

and all its

_{derivatives, reducing}

overheads for

storage

and calculation of addresses of individual elements

_(though

at the _expenseof

_reserving

space for zero derivatives

corresponding

to non-zero

_l2!).

_Moreover,

schemes aimed at

_reducing

the ’fill-in’

during

the factorisation of

_M,

should also reduce

_{computational}

requirements

for

determining

derivatives.

Matrix

_storage

_required

in

_evaluating

derivatives of

_log

_£

can be considerable: for

p

parameters

to be

_estimated,

there are _pfirst and

p(p+1)/2

second

derivatives, ie,

up to

1!-p(p+3)/2

times as much _spaceas for

_{calculating log £ only}

can be

required.

Even for

analyses

with one random factor

(animals)

only,

this becomes

prohibitive

very

quickly, amounting

to a factor of 28

_for

q = 2 traits and p = 6

_parameters,

and 91 for q = 3 and p = 12.

However,

while matrices

aLla0

i

are needed to evaluate

second

derivatives,

matrices

₈

₂

_L/

₈

₍₎

_i

₈₍₎

_j

are not

_required

after

_a

₂

_{log ICI/}

₈

₍₎

_i8

₍₎

_j

and

₈

₂

_y’Py

_/

₈₍₎

_i

₈₍₎

_j

have been calculated from their

diagonal

elements.

_Hence,

while it is most efficient to evaluate all derivatives of each

_li!

_{simultaneously,}

second derivatives can be determined one at a time after L and its first derivatives have

been determined. This can be done

_by

_setting

_upand

_processing

each

8

2 M/8()

i

8()j

individually

thus

_reducing

_memory

_{required dramatically.}

Software

(20)

Cholesky

decomposition

of the covariance matrices

_(and

logarithmic

values of their

diagonals),

as described

_above,

to remove constraints on the

_parameter

_space. Prior to

_estimation,

the

ordering

of rows and columns 1 to M - 1 in M was

determined

_using

the minimum

_{degree re-ordering performed by George}

and Liu’s

(1981)

subroutine

_GENQMD,

and their subroutine SBMFCT was used to establish the

_symbolic

factorisation of M

_(all

M rows and

columns)

and the associated

compressed

storage

pointers, allocating

space for all non-zero elements of L. In

addition,

the use of the _averageinformation matrix was

implemented.

However,

this was done

merely

for the

_comparison

of _{convergence rates}without

_making

use

of the fact that

_only

the derivatives of

y’Py

were

_required.

For each

_iterate,

the

optimal

step

size or

_scaling

factor

(see

[47]

and

!48!)

was

determined

_{by carrying}

out a

one-dimensional,

derivative-free search. This was done

using

a

_{quadratic approximation}

of the likelihood

surface, allowing

for _{up to}five

approximation

steps

per iterate. Other

techniques,

such as a

_{simple step-halving}

procedure,

would be suitable alternatives.

Both

_procedures

described

_by

Smith

₍₁₉₉₅₎

to _{carry out}the automatic

differenti-ation of the

_Cholesky

factor of a matrix were

implemented.

Forward differentiation

was set _upto evaluate L and all its first and second derivatives to be as efficient as

possible,

ie,

holding

all matrices of derivatives in core and

evaluating

them

simulta-neously.

In

_addition,

it was _{set up}to reduce _memory

_{requirements, ie,}

_calculating

the

_Cholesky

factor and its first derivatives

_together

and

_subsequently

_evaluating

second derivatives

_{individually.}

Backward differentiation was

implemented

storing

all first derivatives of M in core but

_setting

_upand

processing

one second derivative

of M at a time.

EXAMPLES

To illustrate the

_{calculations,}

consider the test data for a bivariate

analysis

given

by

Meyer

(1991).

Fitting

a

_simple

animal

model,

there are six

_parameters.

Table III summarises intermediate results for the likelihood and its

derivatives

for the

_starting

values used

by Meyer

(1991),

and

_gives

estimates from the first round of iteration for

_simple

or modified NR or MSC

algorithms.

Without

reparameterisation,

the

(unmodified)

NR

_{algorithm produced}

estimates out of bounds of the

_parameter

space, while the MSC

performed

quite

well for this first iterate.

Continuing

to use

expected

values of second derivatives

though,

estimates failed to _converge.

Figure

1 shows the

_{corresponding}

change

in

_log

G over rounds of iteration. For

this small

_example,

with a

_starting

value for the

additive

_genetic

covariance

(

QA12

)

very different from the eventual

estimate,

a NR or MSC

_algorithm

_optimising

the

step

size

_(see

_equation

_!47!)

failed to locate a suitable

_step

size

(which

increased

log

G)

in the first iterate.

_Essentially,

all

_algorithms

had reached about the same

point

on the likelihood surface

_by

the fifth

_iterate,

with _verylittle

changes

in

log £

in

_subsequent

rounds.

_Convergence

was considered achieved when the norm of the

gradient

vector was less than

10-

4 .

This was a rather strict criterion:

changes

in

estimates and likelihood values between iterates were

_usually

_verysmall before

(21)

(22)

likelihood evaluations were

_required

to reach _convergence

_using

the observed information

compared

to

_eight

iterates and 34 likelihood evaluations for the average

information while the MSC

_{(expected information)}

failed in this case.

_Optimising

the

_step

size

_(see

_!47)),

_{7 + 40,}

10 + 62 and 10 + 57 iterates + likelihood evaluations

were

_{required using}

the

observed, expected

and average

information,

respectively.

As illustrated in

_Figure

_2,

use of the

_’average

information’

yielded

a _verysimilar

convergence

pattern

to that of the NR

algorithm.

In

_{comparison, using}

an EM

algorithm

for this

_example,

80 rounds of iteration were

required

before the

_change

in

_{log £}

between iterates was less than

10-

4

and 142 iterates were needed before

the average

changes

in estimates was less than

0.01%.

a

( T A ij

: additive

genetic

covariances; (T Eij : residual

covariances;

b

log

likelihood: values

given

differ from those

_{given by Meyer}

(1991)

_by

a constant offset of 3.466 due to

absorp-tion of

_single-link

_parentswithout records

_{(‘pruning’).}

_Components

of first derivatives of

_{log ,C;}

see text for definition.

d

First

derivative of

_log

G with _respectto

_B

_i

_.

e Second

derivative of

_{log £}

with _respectto

Bi.

f

Expected

value of second derivative of

_{log £}

with respect to

_B

_i

_.

second derivatives of

_log

G with _respectto

_O

_i

and

₀

_j

_:

observed values

(23)

Table IV

_gives

characteristics of the data structure and model of

_analysis

for

applied examples

of multivariate

analyses

of beef cattle data. The first is a bivariate

analysis

fitting

an animal model with maternal

_permanent

environmental effects as an additional random

_{effect, ie, estimating}

nine covariance

_components.

The second

shows the

_analysis

of three

_weight

traits in Zebu Cross cattle

_(analysed

_previously;

see

_Meyer

(1994a))

under three models of

analysis,

_estimating

_{up to}24

_parameters.

For each data set and

model,

the

_{computational requirements}

to obtain

deriva-tives of the likelihood were determined

_using

forward and backward differentiation

!NR:

Newton-Raphson;

MSC: method of

scoring;

MIX:

_Starting

as

_{MSC, switching}

to NR when

_change

in

_log

likelihood between iterates

_drops

below

1.0;

and DF: derivative-free

algorithm.

b

CB:

Cannon bone

length;

HH:

_{hip height;}

WW:

weaning weight;

YW:

yearling weight;

FW: final

weight.

!1:

_simple

animal

_model;

2: animal model

_fitting

dams’ _permanentenvironmental

_effect;

and 5: animal model

_fitting

both

_genetic

and

permanent environmental effects.

d

A:

Forward

_{differentiation, processing}

all derivatives

simultaneously;

B: Forward

_{differentiation, processing}

second derivatives one at a _time;

C: Backward differentation. 0: No

_{reparameterisation,}

_unmodified;

S:

_{reparameterised}

scale, optimising

step

size,

see

[47]

in _text;T:

_{reparameterised scale, modifying diagonals}

(24)

(25)

as described

_by

Smith

_(1995).

If the _memoryavailable

_allowed,

forward

differentia-tion was carried out for all derivatives

_{simultaneously}

and

_considering

one matrix

8

2 M/

8 ()

i

8()j

at a time. Table IV

_{gives computing}

times in minutes for

calcula-tions carried out on a DEC

Alpha Chip

(DIGITAL)

machine

_running

under

OSF/1

I

(rated

at about 200

_Mflops).

Clearly,

the backward differentiation is most

compet-itive,

with the time

_required

to calculate 24 first and 300 second derivatives

_being

’only’

about 120 times that to obtain

_{log L only.}

Starting

values used were

_{usually ’good’,}

_using

estimates of variance

_components

from

_{corresponding}

univariate

_{analyses throughout}

and

_deriving

initial _guesses for covariance

components

from these and literature values for the correlations concerned. A maximum of 20 iterates was carried

_out,

_applying

a _very

_stringent

convergence criterion of a

change

in

log L

less than

10-

6

between iterates or a value

for the norm of the

_gradient

vector of less than

10-

4

as above.

The numbers of iterates and additional likelihood evaluations

_required

for each

algorithm

given

in table IV show small and inconsistent differences between them. On the

_{whole, starting}

with a MSC

_algorithm

and

_switching

to NR when the

_change

in

_log

G became small and

_{applying Marquardt’s}

₍₁₉₆₃₎

modification

_{(see !48!)}

tended to be more robust

_(ie,

achieve convergence when other

algorithm(s) failed)

for

_starting

values not too close to the eventual estimates.

_{Conversely, they}

tended

to be slower to _convergethan an NR

_optimising

_step

size

(see

!47))

for

_starting

values very close to the estimates.

_However,

once the derivatives of

log

G have

been

_evaluated,

it is

_{computationally}

_{relatively inexpensive}

_(requiring

_only

_simple

likelihood

_evaluations)

to compare and switch between

_algorithms,

ie,

it should be feasible to select the best

_procedure

to be used for each

_{analysis individually.}

Comparisons

with EM

_algorithms

were not carried out for these

_examples.

However,

Misztal

₍₁₉₉₄₎

claimed that each _{sparse matrix}inversion

_required

for

an EM

_step

took about three times as

long

as one likelihood evaluation. This

would mean that each iterate

_using

second derivatives for the three-trait

_analysis

estimating

24

_highly

correlated

_{(co)variance components}

would

_require

about the

same time as 40 EM

_iterates,

or that the second derivative

_algorithm

would be

advantageous

if the EM

_algorithm

_required

more than 462 iterates to _converge.

DISCUSSION

For univariate

_analyses,

_Meyer

_(1994b)

found no

advantage

in

_{using Newton-type}

algorithms

over derivative-free REML.

_{However, comparing}

total _cputime per

analysis, using

derivatives _appearsto be

_{highly advantageous}

over a

derivative-free

_algorithm,

for multivariate

_analyses,

the more so the

larger

the number of

parameters

to be estimated.

_Furthermore,

for

_{analyses involving larger}

numbers of

parameters (18

or

_24),

the derivative-free

_algorithm

converged

several times to a

local

_maximum,

a

_problem

observed

_{previously by}

Groeneveld and Kovac

(1990),

while the

_Newton-type

_{algorithm appeared}

not be affected.

Modifications of the NR

algorithm,

combined with a local search for the

_optimum

step

size,

improved

its _convergencerate. This was achieved

_{primarily by}

safe-guarding against

steps

in the

_’wrong’

direction in

_{early iterates,}

thus

_making

the

search for the maximum of the likelihood function more robust

_against

bad