• Aucun résultat trouvé

Restricted maximum likelihood estimation for animal models using derivatives of the likelihood

N/A
N/A
Protected

Academic year: 2021

Partager "Restricted maximum likelihood estimation for animal models using derivatives of the likelihood"

Copied!
28
0
0

Texte intégral

(1)

HAL Id: hal-00894119

https://hal.archives-ouvertes.fr/hal-00894119

Submitted on 1 Jan 1996

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Restricted maximum likelihood estimation for animal

models using derivatives of the likelihood

K Meyer, Sp Smith

To cite this version:

(2)

Original

article

Restricted maximum likelihood

estimation for animal models

using

derivatives

of

the likelihood

K

Meyer,

SP Smith

Animal Genetics and

Breeding Unit, University of

New

England, Armidale,

NSW

2351,

Australia

(Received

21 March

1995; accepted

9 October

1995)

Summary - Restricted maximum likelihood estimation

using

first and second derivatives of the likelihood is described. It relies on the calculation of derivatives without the need

for

large

matrix inversion

using

an automatic differentiation

procedure.

In essence, this

is an extension of the

Cholesky

factorisation of a matrix. A

reparameterisation

is used

to transform the constrained

optimisation

problem imposed

in

estimating

covariance components to an unconstrained

problem,

thus

making

the use of

Newton-Raphson

and related

algorithms

feasible. A numerical

example

is

given

to illustrate calculations. Several modified

Newton-Raphson

and method of

scoring algorithms

are

compared

for

applications

to

analyses

of beef cattle

data,

and contrasted to a derivative-free

algorithm.

restricted maximum likelihood

/

derivative

/

algorithm

/

variance component esti-mation

Résumé - Estimation du maximum de vraisemblance restreinte pour des modèles in-dividuels par dérivation de la vraisemblance. Cet article décrit une méthode d’estimation

du maximum de vraisemblance restreinte utilisant les dérivées

première

et seconde de la vraisemblance. La méthode est basée sur une

procédure

de

différenciation

automatique ne

nécessitant pas l’inversion de

grandes

matrices. Elle constitue en

fait

une extension de la

décomposition

de

Cholesky appliquée

à une matrice. On utilise un

paramétrage

qui

trans-forme

le

problème d’optimisation

avec contrainte que soulève l’estimation des composantes

de variance en un

problème

sans contrainte, ce qui rend

possible

l’utilisation

d’algorithmes

de

Newton-Raphson

ou

apparentés.

Les calculs sont illustrés sur un

exemple numérique.

Plusieurs

algorithmes,

de type

Newton-Raphson

ou selon la méthode des scores, sont

ap-pliqués

à

l’analyse

de données sur bovins à viande. Ces

algorithmes

sont

comparés

entre

eu! et par ailleurs

comparés

à un

algorithme

sans dérivation.

maximum de vraisemblance restreinte

/

dérivée

/

algorithme

/

estimation de

composante de variance *

(3)

INTRODUCTION

Maximum likelihood estimation of

(co)variance

components

generally

requires

the numerical solution of a constrained nonlinear

optimisation

problem

(Harville,

1977).

Procedures to locate the minimum or maximum of a function are classified

according

to the amount of information from derivatives of the function

utilised;

see, for

instance,

Gill et al

(1981).

Methods

using

both first and second derivatives are

fastest to converge, often

showing quadratic

convergence, while search

algorithms

not

relying

on derivatives are

generally

slow, ie, require

many iterations and function evaluations.

Early applications

of restricted maximum likelihood

(REML)

estimation to

animal

breeding

data used a Fisher’s method of

scoring

type

algorithm, following

the

original

paper

by

Patterson and

Thompson

(1971)

and

Thompson

(1973).

This

requires

expected

values of the second derivatives of the likelihood to be

evaluated,

which

proved computationally

highly demanding

for all but the

simplest

analyses.

Hence

expectation-maximization

(EM)

type

algorithms gained popularity

and found

widespread

use for

analyses fitting

a sire model.

Effectively,

these use first derivatives of the likelihood function.

Except

for

special

cases,

however,

they

required

the inverse of a matrix of size

equal

to the number of random effects

fitted,

eg, number of sires times number of

traits,

which

severely

limited the size of

analyses

feasible.

For

analyses

under the animal

model,

Graser et al

(1987)

thus

proposed

a

derivative-free

algorithm.

This

only

requires

factorising

the coefficient matrix of the mixed-model

equations

rather than

inverting it,

and can be

implemented efficiently

using

sparse matrix

techniques. Moreover,

it is

readily

extendable to animal models

including

additional random effects and multivariate

analyses

(Meyer,

1989,

1991).

Multi-trait animal model

analyses

fitting

additional random effects

using

a

derivative-free

algorithm

have been shown to be feasible.

However,

they

are

com-putationally highly demanding,

the number of likelihood evaluations

required

in-creasing

exponentially

with the number of

(co)variance

components

to be estimated

simultaneously.

Groeneveld et al

(1991),

for

instance,

reported

that 56 000 evalu-ations were

required

to reach a

change

in likelihood smaller than

10-

7

when

esti-mating

60 covariance

components

for five traits. While

judicious

choice of

starting

values and search

strategies

(eg,

temporary

maximisation with

respect

to a subset

of the

parameters

only)

together

with

exploitation

of

special

features of the data

structure

might

reduce demands

markedly

for individual

analyses,

it remains true that derivative-free maximisation in

high

dimensions is very slow to converge.

This makes a case for REML

algorithms

using

derivatives of the likelihood

for

multivariate,

multidimensional animal model

analyses.

Misztal

(1994)

recently

presented

a

comparison

of rates of convergence of derivative-free and derivative

algorithms, concluding

that the latter had the

potential

to be faster in almost all cases, in

particular

that their convergence rate

depended

little on the number of traits considered.

Large-scale

animal model

applications using

an EM

type

algorithm

(Misztal, 1990)

or even a method of

scoring

algorithm

(Ducrocq, 1993)

have been

reported, obtaining

the

large

matrix inverse

(or

its

trace)

required by

(4)

REML estimation under an animal model

using

first and second derivatives of the

likelihood

function, computed

without

inverting large

matrices.

DERIVATIVES OF THE LIKELIHOOD

Consider the linear mixed model

where y,

b,

u and e denote the vectors of

observations,

fixed

effects,

random effects and residual errors,

respectively,

and X and Z are the incidence matrices

pertaining

to b and u. Let

V(u)

=

G,

V(e)

= R and

Cov(u,e’)

=

0,

so that

V(y)

= V =

ZGZ’

+ R.

Assuming

a multivariate normal

distribution, ie,

y -

N(Xb, V),

the

log

of the REML likelihood

(G)

is

(eg,

Harville,

1977)

where X* denotes a full-rank submatrix of X.

REML

algorithms

using

derivatives have

generally

been derived

by differentiating

!2!.

However,

as outlined

previously

(Graser

et

al, 1987; Meyer,

1989),

log L

can be

rewritten as

where C is the coefficient matrix in the mixed-model

equations

(MME)

pertaining

to

[1]

(or

a full rank submatrix

thereof),

and P is a

matrix,

Alternative forms of the derivatives of the likelihood can then be obtained

by differentiating

[3]

instead of

!2!.

Let 0 denote the vector of

parameters

to be estimated with elements

9z ,

i =

1, ... ,

p. The first and second

partial

derivatives of the

log

likelihood are then

Graser et al

(1987)

show how the last two terms in

!3!,

log

ICI

and

y’Py,

can

be evaluated in a

general

way for all models of form

[1]

by carrying

out a series

of Gaussian elimination

steps

on the coefficient matrix in the MME

augmented

by

the vector of

right-hand

sides and a

quadratic

in the data vector.

Depending

on the

model of

analysis

and structure of G and

R,

the other two terms

required

in

!3!,

(5)

1991),

generally

requiring only

matrix

operations proportional

to the number of traits considered. Derivatives of these four terms can be evaluated

analogously.

Calculating

logiC!

and

y’Py

and their derivatives

The mixed-model matrix

(MMM)

or

augmented

coefficient matrix

pertaining

to

[1]

is

where r is the vector of

right-hand

sides in the MME.

Using general

matrix

results,

the derivatives of

log

C !

are

Partitioned matrix results

give

log

IMI

=

log

!C!

+

log(y’Py),

ie,

(Smith, 1995)

This

gives

derivatives

Obviously,

these

expressions

([7], [8], [10]

and

[11])

involving

the inverse of the

large

matrices M and C are

computationally

intractable for any sizable animal model

analysis.

However,

the Gaussian elimination

procedure

with

diagonal

pivoting

advocated

by

Graser et al

(1987)

is

only

one of several ways to ’factor’

a matrix. An alternative is a

Cholesky

decomposition.

This lends itself

readily

to the solution of

large

positive

definite

systems

of linear

equations

using

sparse

matrix

storage

schemes.

Appropriate

Fortran routines are

given,

for

instance,

by

George

and Liu

(1981)

and have been used

successfully

in derivative-free REML

(6)

for j

>

i)

denote the

Cholesky

factor of

M, ie,

M = LL’. The determinant of a

triangular

matrix is

simply

the

product

of its

diagonal

elements.

Hence,

with M

denoting

the size of

M,

and

from

I

Smith

(1995)

describes

algorithms,

outlined

below,

which allow the derivatives of the

Cholesky

factor of a matrix to be evaluated while

carrying

out the

factorisation,

provided

the derivatives of the

original

matrix are

specified.

Differentiating

[13]

and

[14]

then

gives

the derivatives of

log

ICI

and

y’Py

as

simple

functions of the

diagonal

elements of the

Cholesky

matrix and its derivatives.

Calculating

logIRI

and its derivatives

Consider a multivariate

analysis

for q traits and let y be ordered

according

to traits within animals.

Assuming

that error covariances between measurements on

different animals are zero, R is

blockdiagonal

for

animals,

where is N the number of animals which have

records,

and E

+

denotes the direct matrix sum

(Searle,

1982).

Hence

log

IRI

as well as its derivatives can be determined

by

considering

one animal at a time.

(7)

combinations of traits recorded

(assuming

single

records per

trait),

eg, W = 3 for

q = 2 with combinations trait 1

only,

trait 2

only

and both traits. For animal i which has combination of traits w,

R

i

is

equal

to

E

w

,

the submatrix of E obtained

by deleting

rows and columns

pertaining

to

missing

records. As outlined

by Meyer

(1991),

this

gives

where

N

w

represents

the number of animals

having

records for combination of traits

w.

Correspondingly,

Consider the case where the

parameters

to be estimated are the

(co)variance

components

due to random effects and residual errors

(rather

than,

for

example,

p

heritabilities and

correlations),

so that V is linear in

0,

ie,

V =

£

9

j0V/

0

9z.

i=l

Defining

with elements

d

kL

=

1,

if the klth element

ofE

w

is

equal

to

9z ,

and

dk!

= 0

otherwise,

this then

gives

Let

e!

denote the rsth element of

Ew

l

.

For ()

i

= ekl and

9

j

=

e&dquo;, n

,

[23]

and

[24]

then

simplify

to

(8)

For

q = 1 and R =

(

T

2j,

[25]

and

[26]

become

N

QE

2

and

-N(j

E4

,

respectively

(for

o

i

= oj

=

U2

E ).

Extensions for models with

repeated

records are

straightforward.

Hence,

once the inverses of the matrices of residual covariances for all combination

of numbers of traits recorded

occurring

in the data have been obtained

(of

maximum

size

equal

to the maximum number of traits recorded per

animal,

and also

required

to set up the

MMM),

evaluation of

log

!R!

and its derivatives

requires

only

scalar

manipulations

in addition.

Calculating

loglGI

and its derivatives

Terms

arising

from the covariance matrix of random

effects, G,

can often be

determined in a similar way,

exploiting

the structure of G. This

depends

on the

random effects fitted.

Meyer

(1989, 1991)

describes

log

IGI

for various cases.

Define T with elements

t

ij

of size rq x rq as the matrix of covariances between

random effects where r is the number of random factors in the model

(excluding

e).

For

illustration,

let u consist of a vector of animal

genetic

effects a and some uncorrelated additional random

effect(s)

c with

N

c

levels per

trait, ie,

u’ =

(a’c’).

In the

simplest

case, a consists of the direct additive

genetic

effects for each animal

and

trait, ie,

it has

length

qN

A

where

N

A

denotes the total number of animals in the

analysis, including

parents

without records. In other cases, a

might

include a

second

genetic

effect for each animal and

trait,

such as a maternal additive

genetic

effect,

which may be correlated to the direct

genetic

effects. An

example

for c is a common environmental effect such as a litter effect.

With a and c

uncorrelated,

T can be

partitioned

into

corresponding diagonal

blocks

T

A

and

T

C

,

so that

where A is the numerator

relationship

between

animals, F,

often assumed to be the

identity

matrix,

describes the correlation structure

amongst

the levels of c, and

x denotes the direct matrix

product

(Searle, 1982).

This

gives

(Meyer,

1991)

Noting

that all

8

2

T/8()i8()j

= 0

(for

V linear in

0),

derivatives are

where

DA

=

8T

A/

a9

i

and

D!

=

8T

c/

8()

i

are

again

matrices with elements 1

if tkl =

()

i

and zero otherwise. As

above,

all second derivatives

for O

i

and

9

j

not

pertaining

to the same random factor

(eg, c)

or two correlated factors

(such

as

direct and maternal

genetic

effects)

are zero.

Furthermore,

all derivatives of

log

G ) I

with

respect

to residual covariance

components

are zero.

Further

simplifications analogous

to

[25]

and

[26]

can be derived. For

instance,

(9)

random effects

(r

=

1),

T is the matrix of additive

genetic

covariances

ai! with

i, j = 1, ... , q. For O

i

= ax!

and O

j

=

a mn

, this

gives

with ars

denoting

the rsth element of

T-

1

.

For

q = 1 and all =

QA

,

[31]

and

[32]

reduce to

N

A

(jA

2

and

-N

A

(jA

4

,

respectively.

Derivatives of the mixed model matrix

As

emphasised

above,

calculation of the derivatives of the

Cholesky

factor of M

requires

the

corresponding

derivatives of M to be evaluated.

Fortunately,

these have the same structure as M and can be evaluated while

setting

up

M,

replacing

G and R

by

their derivatives.

For O

i

and O

j

equal

to residual

(co)variances,

the derivatives of M are of the

form

with

Q

R

standing

in turn for

and

for first and second

derivatives, respectively.

As outlined

above,

R is

blockdiagonal

for animals with submatrices

E

w

.

Hence,

matrices

Q

R

have the same structure with submatrices

and

(for

V linear in 0 so that

é

P

R/8()/}()j =

0)

Consequently,

the derivatives of M with

respect

to the residual

(co)variances

can

be set up in the same way as the ’data

part’

of M. In addition to

calculating

the

matrices

Ew

l

for the W combination of records per animal

occurring

in the

data,

all derivatives of the

E-

1

for residual

components

need to evaluated. The extra calculations

required, however,

are

trivial, requiring

matrix

operations proportional

to the maximum number of records per animal

only

to obtain the terms in

[36]

(10)

Analogously,

for O

i

and O

j

equal

to elements of

T,

derivatives of M are

with

Q

G

standing

for

for first

derivatives,

and

for second derivatives.

As

above,

further

simplifications

are

possible depending

on the structure of G. For

instance,

for G as in

[27]

and

[j

2

G/å()

J

}()j

=

0,

and

Qo

2

=

Expected

values of second derivatives of

logc

Differentiating

[2]

gives

second derivatives of

logc

with

expected

values

(Harville, 1977)

Again,

for V linear in

0,

(9’VlaOiaO

j

= 0. From

[5]

and

noting

that

aPla0

i

=

(11)

Hence, expected

values of the second derivatives are

essentially

(sign

ignored)

equal

to the observed values minus the contribution from the

data,

and thus can be

evaluated

analogously.

With second derivatives of

y’Py

not

required, computational

requirements

are reduced somewhat as

only

the first M — 1 rows of

82M/8()i8()j

need to be evaluated and factored.

AUTOMATIC DIFFERENTIATION

Calculation of the derivatives of the likelihood as described above relies on the

fact that the derivatives of the

Cholesky

factor of a matrix can be obtained

’automatically’, provided

the derivatives of the

original

matrix can be

specified.

Smith

(1995)

describes a so-called forward

differentiation,

which is a

straight-forward

expansion

of the recursions

employed

in the

Cholesky

factorisation of a

matrix M.

Operations

to determine the latter are

typically

carried out

sequentially

by

rows. Let

L,

of size

N,

be initialised to M.

First,

the

pivot

(diagonal

element which must be

greater

than an

operational

zero)

is selected for the current row k.

Secondly,

the

off-diagonal

elements for the row

(’lead column’)

are

adjusted

( L

jk

for j

= k +

1, ... ,

N),

and

thirdly

the elements in the

remaining

part

of L

(L2!

for

j

=

k+1, ... ,

N and i =

j, ... ,

N)

are modified

(’row

operations’).

After all N rows have been

processed,

L contains the

Cholesky

factor of M.

Pseudo-code

given

by

Smith

(1995)

for the calculation of the

Cholesky

factor and its first and second derivatives is summarised in table I. It can be seen that the

operations

to evaluate a second derivative

require

the

respective

elements of

the two

corresponding

first derivatives. This

imposes

severe constraints on the

memory

requirements

of the

algorithm.

While it is most efficient to evaluate the

Cholesky

factor and all its derivatives

together,

considerable space can be saved

by

computing

the second derivatives one at a time. This can be done

by holding

all

the first derivatives in memory, or, if core space is the

limiting

factor, storing

first

derivatives on disk

(after

evaluating

them

individually

as

well)

and

reading

in

only

the two

required. Hence,

the minimum memory

requirement

for REML

using

first and second derivatives is 4 x

L, compared

to L for a derivative-free

algorithm.

Smith

(1995)

stated

that, using

forward

differentiation,

each first derivative

required

not more than twice the work

required

to evaluate

log

G

only,

and that the work needed to determine a second derivative would be at most four times that

to calculate

log

G.

In

addition,

Smith

(1995)

described a ’backward differentiation’

scheme,

so

named because it reverses the order of

steps

in the forward differentiation. It is

applicable

for cases where we want to evaluate a scalar function of

L,

f (L),

in our case

log I C + y’Py

which is a function of the

diagonal

elements of L

(see

[13]

and

!14!).

It

requires computing

a

(lower

triangular)

matrix W

which,

on

completion

of

the backward

differentiation,

contains the derivatives of

f (L)

with

respect

to the elements of M. First derivatives of

f (L)

can then be evaluated one at a time as

tr(W 8M/ 8()r)’

The

pseudo-code given

by

Smith

(1995)

for the backward differentiation is shown in table II. Calculation of W

requires

about twice as much work as one likelihood

evaluation, and,

once W is

evaluated,

calculating

individual derivatives

(step

3 in

(12)
(13)

differentiation

requires

only

somewhat more work than calculation of one derivative

by

forward differentiation. Smith

(1995)

also described the calculation of second derivatives

by

backward differentiation

(pseudo-code

not shown

here).

Amongst

other

calculations,

this involves one evaluation of a matrix W as described

above,

for each

parameter

and

requires

another work array of size L in addition to space to store at least one matrix of derivatives of M. Hence the minimum memory

requirement

for this

algorithm

is 3 x L + M

(M

and L

differing by

the fill-in created

during

the

factorisation).

Smith

(1995)

claimed that the total work

required

to evaluate all second derivatives for p

parameters

was no more than

6p

times that for a likelihood evaluation.

MAXIMISING THE LIKELIHOOD

Methods to locate the maximum of the likelihood function in the context of

variance

component

estimation are

reviewed,

for

instance,

by

Harville

(1977)

and

Searle et al

(1992;

Chapter

8).

Most utilise the

gradient

vector,

ie,

vector of first derivatives of the likelihood

function,

to determine the direction of search.

Using

second derivatives

One of the oldest and most

widely

used methods to

optimise

a non-linear function

is the

Newton-Raphson

(NR)

algorithm.

It

requires

the Hessian matrix of the

function,

ie,

the matrix of second

partial

derivatives of the

(log)

likelihood with

(14)

where

H’

=

{å2log£/å()iå()j}

and g’

=

10logLI,90

il

are the Hessian matrix

and

gradient

vector of

log

£

,

respectively,

both evaluated at 0 =

0!.

While the NR

algorithm

can be

quick

to converge, in

particular

for functions

resembling

a

quadratic function,

it is known to be sensitive to poor

starting

values

(Powell, 1970).

Unlike other

algorithms,

it is not

guaranteed

to converge

though global

convergence has been shown for some cases

using

iterative

partial

maximisation

(Jensen

et

al,

1991).

In

practice,

so-called extended or modified NR

algorithms

have been found to

be more successful. Jennrich and

Sampson

(1976)

suggested

step

halving,

applied

successively

until the likelihood is found to

increase,

to avoid

’overshooting’.

More

generally,

the

change

in estimates for the tth iterate in

[46]

is

given

by

for the extended

NR,

B’

=

—T!(H!) !,

where

T

is a

step-size

scaling

factor. The

optimum

for

T

t

can

be determined

readily

as the value which results in the

largest

increase in

likelihood, using

a one-dimensional maximisation

technique

(Powell,

1970).

This relies on the direction of search

given by

H-

l

g

generally

being

a ’good’

direction and

that,

for -H

positive definite,

there is

always

a

step-size

which will

increase the likelihood.

Alternatively,

the use of

has been

suggested

(Marquardt, 1963)

to

improve

the

performance

of the NR

algorithm.

This results in a

step

intermediate between a NR

step

(!

=

0)

and

a method of

steepest

ascent

step (

K

large).

Again, !

can be chosen to maximise the

increase in

log G, though

for

large

values of K the

step

size is

small,

so that there is no need to include a search

step

in the iteration

(Powell, 1970).

Often

expected

values of the second derivatives of

log

G

are easier to calculate than the observed values.

Replacing -H

by

the information matrix

results in Fisher’s method of

scoring

(MSC).

It can be extended or modified in the

same way as the NR

algorithm

(Harville, 1977).

Jennrich and

Sampson

(1976)

and

Jennrich and Schluchter

(1986)

compared

NR and

MSC, showing

that the MSC was

generally

more robust

against

a poor choice of

starting

values than the

NR,

though

it tended to

require

more iterations.

They

thus recommended a scheme

using

the

MSC

initially

and

switching

to NR after a few rounds of iteration when the increase

in

log

G between

steps

was less than one.

Using

first derivatives

only

Other

methods,

so-called variable-metric or

Quasi-Newton procedures, essentially

use the same

strategies,

but

replace

B

by

an

approximation

of the Hessian matrix.

(15)

An

interesting

variation has

recently

been

presented

by

Johnson and

Thompson

(1995).

Noting

that the observed and

expected

information were of

opposite sign

and differed

only by

a term

involving

the second derivatives of

y’Py (see

[5]

and

[45]),

they

suggested

using

the average of observed and

expected

information

(AI)

to

approximate

the Hessian matrix. Since it

requires only

the second derivatives of

y’Py

to be

evaluated,

each iterate is

computationally considerably

less

demanding

than a ’full’ NR or MSC

step,

and the same modifications as described above for the

NR

(see

[47]

and

[48])

can be

applied.

Initial

comparisons

of the rate of convergence and

computer

time

required

showed the AI

algorithm

to be

highly

advantageous

over both derivative-free and

EM-type

procedures

(Johnson

and

Thompson,

1995).

Constraining

parameters

All the

Newton-type

algorithms

described above

perform

an unconstrained

optimisation.

To

estimating

(co)variance

components,

however,

we

require

variances

be

non-negative,

correlations to be in the range from -1 to 1

and,

for more than

two

traits,

for them to be consistent with each

other;

more

generally,

estimated

covariance matrices need to be

(semi-)

positive

definite. As shown

by

Hill and

Thompson

(1978),

the

probability

of

obtaining

parameter

estimates out of bounds

depends

on the

magnitude

of the correlations

(population values)

and increases

rapidly

with the number of traits

considered,

in

particular

for

genetic

covariance

matrices.

For univariate

analyses,

a common

approach

has been to set

negative

estimates of variance

components

to zero and continue

iterating.

Partial

sweeping

of B to handle

boundary constraints, monitoring

and

restraining

the size of

pivots

relative to the

corresponding

(original)

diagonal

elements has been recommended

(Jennrich

and

Sampson, 1976;

Jennrich and

Schluchter,

1986).

Harville

(1977;

section

6.3)

dis-tinguished

between three

types

of

techniques

to

modify

unconstrained

optimisation

procedures

to accommodate constraints on the

parameter

space.

Firstly, penalty

techniques

operate

on a function which is close to

log L

except

at the boundaries

where it assumes

large

negative

values which

effectively

serve as a

barrier,

deflecting

a further search in that direction.

Secondly, gradient

projection techniques

are

suit-able for linear

inequality

constraints.

Thirdly,

it may be feasible to transform the

parameters

to be estimated so that maximisation on the new scale is unconstrained.

Box

(1965)

demonstrated for several

examples

that the

computational

effort to

solve constrained

problems

can be reduced

markedly by

eliminating

constraints so

that one of the more

powerful

unconstrained

methods,

preferably

with

quadratic

convergence, can be

employed.

For univariate

analysis,

an obvious way to eliminate

non-negativity

constraints is to maximise

log L

with

respect

to standard deviations and to square these on convergence

(Harville,

1977).

Alternatively,

we could

estimate

logarithmic

values of variance

components

instead of the variances. This

seems

preferable

to

taking

square roots

where,

on

backtransforming,

a

largish

negative

estimate

might

become a substantial

positive

estimate of the

corresponding

variance. Harville

(1977)

cautioned however that such transformations may result in the introduction of additional

stationary points

on the likelihood

surface,

and

thus should be used

only

in

conjunction

with

optimisation techniques ensuring

an

(16)

Further motivation for the use of transformations has been

provided

by

the

scope to reduce

computational

effort or to

improve

convergence

by

making

the

shape

of the likelihood function on the new scale more

quadratic.

For multivariate

analyses

with one random effect and

equal

design

matrices,

for

instance,

a canonical

transformation allows estimation to be broken down into a series of

corresponding

univariate

analyses

(Meyer, 1985);

see Jensen and Mao

(1988)

for a review. Harville and Callanan

(1990)

considered different forms of the likelihood for both

NR and MSC and demonstrated how

they

affected convergence behaviour. In

particular,

a ’linearisation’ was found to reduce the number of iterates

required

to reach convergence

considerably. Thompson

and

Meyer

(1986)

showed how a

reparameterisation

aimed at

making

variables less correlated could

speed

up an

expectation-maximisation

algorithm dramatically.

For univariate

analyses

of

repeated

measures data with several random

effects,

Lindstrom and Bates

(1988)

suggested

maximisation of

log £

with

respect

to the non-zero elements of the

Cholesky decomposition

of the covariance matrix of random effects in order to remove constraints on the

parameter

space and

to

improve stability

of the NR

algorithm.

In

addition, they

chose to estimate

the error variance

directly,

operating

on the

profile

likelihood of the

remaining

parameters.

For several

examples,

the authors found consistent convergence of the

NR

algorithm

when

implemented

this way, even for an

overparameterised

model.

Recently,

Groeneveld

(1994)

examined the effect of this

reparameterisation

for

large-scale

multivariate

analyses

using

derivative-free

REML, reporting

substantial

improvements

in

speed

of convergence for both direct search

(downhill Simplex)

and

Quasi-Newton

algorithms.

IMPLEMENTATION

Reparameterisation

For our

analysis,

parameters

to be estimated are the non-zero elements of the lower

triangle

of the two covariance matrices E and

T,

with T

potentially

consisting

of

independent diagonal

blocks,

depending

on the random effects fitted. Let

The

Cholesky decomposition

of U has the same structure

Estimating

the non-zero elements of matrices

L

u.

then ensures

positive

definite matrices

U

r

on

transforming

back to the

original

scale

(Lindstrom

and

Bates,

1988).

However,

for the

U

r

representing

covariance

matrices,

the ith

diagonal

element of

L

(17)

is

suggested

to

apply

a

secondary transformation, estimating

the

logarithmic

values

of these

diagonal

elements and

thus,

as discussed above for univariate

analyses

of

variance

components,

effectively forcing

them to be

greater

than zero.

Let v denote the vector of

parameters

on the new scale. In order to maximise

log G

with

respect

to the elements of v, we need to transform its derivatives

accordingly.

Lindstrom and Bates

(1988)

describe

briefly

how to obtain the

gradient

vector and Hessian matrix for a

Cholesky

matrix

L

Ur

from those for the matrix

U

r

(see

also

corrections,

Lindstrom and Bates

(1994)).

For the ith element of v,

More

generally,

for a one-to-one transformation

(Zacks, 1971)

where J with elements

8()d8v

j

is the Jacobian of 9 with respect to v.

Similarly,

with the first

part

of

[53]

equal

to the ith element of

(0J’/0vj )(0 10g £/09k)

and

the second

part

equal

to

ijth

element of

J’ {

ô

2log L/ Ô()

i

Ô()

j

}J.

Consider one covariance matrix

U

r

at a

time,

dropping

the

subscript

r for

convenience,

and let u,t and

l

wx

denote the elements of U and

L

U

,

respectively.

From U =

LL’,

it follows that

where

min(s, t)

is the smaller value of s and t.

Hence,

the

ijth

element of J for

()i

= Ust

and v

j

=

l

wx

is

For

w # x

and

s,

t ! x, this is non-zero

only

if at least one of s and t is

equal

to

(18)

cases to consider :

For

example,

for q = 3 traits and six elements in B

(ull,

ulz, U13, uzz, U23 and

U33

)

and v

(log(lil),

l

zi

,

l

si

,

),

log(<

22

1

32

,

log(l

33

))

Similarly,

the

ijth

element of

}

j

iåv

V

å

k/

()

2

(see

[53])

for

9

k

=

u, t

, vi =

l

wx

and

1/! =

ly

z

is

Allowing

for the additional

adjustments

due to the

log

transformation of

diago-nals,

[56]

is non-zero in five cases

(for

w 7!

t

!

x and

x, t <

w):

For the above

example,

this

gives

the first two derivatives of J

In some cases, a covariance

component

cannot be

estimated,

for

instance,

the error covariance between two traits measured on different sets of animals. On the

(19)

new

parameter

may be non-zero. This can be accommodated

by, again,

not

estimating

this

parameter

but

calculating

its new value from the estimates of the

other

parameters,

similar to the way a correlation is fixed to a certain value

by

only

estimating

the

respective

variances and then

calculating

the new covariance

’estimate’ from them.

For

example,

consider three traits with the covariance between traits 2 and

3,

U23, which are not estimable.

Setting

u23 = 0

gives

a non-zero value on

the

reparameterised

scale of

!32

=

-1

21

1

31

/1

22

.

Only

the other five

parameters

(lii, 1

21

, 131, 122,

and

l

33

)

are then estimated

(ignoring

the

log

transformation of

diagonals

here for

simplicity)

while an ’estimate’ of

l

32

is calculated from those of

1

21

,

l

31

and

l

33

according

to the

relationship

above. On

transforming

back to the

original

scale,

this ensures that the covariance between traits 2 and 3 remains fixed

at zero.

Sparse

matrix

storage

Calculation of

log £

for use in derivative-free REML estimation under an animal model has been made feasible for

practically

sized data sets

through

the use of

sparse matrix

storage

techniques.

These

adapt readily

to include the calculation of derivatives of L. One

possibility

is the use of so-called linked lists

(see,

for

instance,

Tier and Smith

(1989)).

George

and Liu

(1981)

describe several other

strategies,

using a ’compressed’

storage

scheme when

applying

a

Cholesky decomposition

to a

large symmetric, positive definite,

sparse matrix.

Elements of the matrices of derivatives of L are subsets of elements of

L, ie,

exist

only

for non-zero elements of L. Thus the same

system

of

pointers

can be used for L

and all its

derivatives, reducing

overheads for

storage

and calculation of addresses of individual elements

(though

at the expense of

reserving

space for zero derivatives

corresponding

to non-zero

l2!).

Moreover,

schemes aimed at

reducing

the ’fill-in’

during

the factorisation of

M,

should also reduce

computational

requirements

for

determining

derivatives.

Matrix

storage

required

in

evaluating

derivatives of

log

£

can be considerable: for

p

parameters

to be

estimated,

there are p first and

p(p+1)/2

second

derivatives, ie,

up to

1!-p(p+3)/2

times as much space as for

calculating log £ only

can be

required.

Even for

analyses

with one random factor

(animals)

only,

this becomes

prohibitive

very

quickly, amounting

to a factor of 28

for

q = 2 traits and p = 6

parameters,

and 91 for q = 3 and p = 12.

However,

while matrices

aLla0

i

are needed to evaluate

second

derivatives,

matrices

8

2

L/

8

()

i

8()

j

are not

required

after

a

2

log ICI/

8

()

i8

()

j

and

8

2

y’Py

/

8()

i

8()

j

have been calculated from their

diagonal

elements.

Hence,

while it is most efficient to evaluate all derivatives of each

li!

simultaneously,

second derivatives can be determined one at a time after L and its first derivatives have

been determined. This can be done

by

setting

up and

processing

each

8

2

M/8()

i

8()j

individually

thus

reducing

memory

required dramatically.

Software

(20)

Cholesky

decomposition

of the covariance matrices

(and

logarithmic

values of their

diagonals),

as described

above,

to remove constraints on the

parameter

space. Prior to

estimation,

the

ordering

of rows and columns 1 to M - 1 in M was

determined

using

the minimum

degree re-ordering performed by George

and Liu’s

(1981)

subroutine

GENQMD,

and their subroutine SBMFCT was used to establish the

symbolic

factorisation of M

(all

M rows and

columns)

and the associated

compressed

storage

pointers, allocating

space for all non-zero elements of L. In

addition,

the use of the average information matrix was

implemented.

However,

this was done

merely

for the

comparison

of convergence rates without

making

use

of the fact that

only

the derivatives of

y’Py

were

required.

For each

iterate,

the

optimal

step

size or

scaling

factor

(see

[47]

and

!48!)

was

determined

by carrying

out a

one-dimensional,

derivative-free search. This was done

using

a

quadratic approximation

of the likelihood

surface, allowing

for up to five

approximation

steps

per iterate. Other

techniques,

such as a

simple step-halving

procedure,

would be suitable alternatives.

Both

procedures

described

by

Smith

(1995)

to carry out the automatic

differenti-ation of the

Cholesky

factor of a matrix were

implemented.

Forward differentiation

was set up to evaluate L and all its first and second derivatives to be as efficient as

possible,

ie,

holding

all matrices of derivatives in core and

evaluating

them

simulta-neously.

In

addition,

it was set up to reduce memory

requirements, ie,

calculating

the

Cholesky

factor and its first derivatives

together

and

subsequently

evaluating

second derivatives

individually.

Backward differentiation was

implemented

storing

all first derivatives of M in core but

setting

up and

processing

one second derivative

of M at a time.

EXAMPLES

To illustrate the

calculations,

consider the test data for a bivariate

analysis

given

by

Meyer

(1991).

Fitting

a

simple

animal

model,

there are six

parameters.

Table III summarises intermediate results for the likelihood and its

derivatives

for the

starting

values used

by Meyer

(1991),

and

gives

estimates from the first round of iteration for

simple

or modified NR or MSC

algorithms.

Without

reparameterisation,

the

(unmodified)

NR

algorithm produced

estimates out of bounds of the

parameter

space, while the MSC

performed

quite

well for this first iterate.

Continuing

to use

expected

values of second derivatives

though,

estimates failed to converge.

Figure

1 shows the

corresponding

change

in

log

G over rounds of iteration. For

this small

example,

with a

starting

value for the

additive

genetic

covariance

(

QA12

)

very different from the eventual

estimate,

a NR or MSC

algorithm

optimising

the

step

size

(see

equation

!47!)

failed to locate a suitable

step

size

(which

increased

log

G)

in the first iterate.

Essentially,

all

algorithms

had reached about the same

point

on the likelihood surface

by

the fifth

iterate,

with very little

changes

in

log £

in

subsequent

rounds.

Convergence

was considered achieved when the norm of the

gradient

vector was less than

10-

4

.

This was a rather strict criterion:

changes

in

estimates and likelihood values between iterates were

usually

very small before

(21)
(22)

likelihood evaluations were

required

to reach convergence

using

the observed information

compared

to

eight

iterates and 34 likelihood evaluations for the average

information while the MSC

(expected information)

failed in this case.

Optimising

the

step

size

(see

!47)),

7 + 40,

10 + 62 and 10 + 57 iterates + likelihood evaluations

were

required using

the

observed, expected

and average

information,

respectively.

As illustrated in

Figure

2,

use of the

’average

information’

yielded

a very similar

convergence

pattern

to that of the NR

algorithm.

In

comparison, using

an EM

algorithm

for this

example,

80 rounds of iteration were

required

before the

change

in

log £

between iterates was less than

10-

4

and 142 iterates were needed before

the average

changes

in estimates was less than

0.01%.

a

( T A ij

: additive

genetic

covariances; (T Eij : residual

covariances;

b

log

likelihood: values

given

differ from those

given by Meyer

(1991)

by

a constant offset of 3.466 due to

absorp-tion of

single-link

parents without records

(‘pruning’).

Components

of first derivatives of

log ,C;

see text for definition.

d

First

derivative of

log

G with respect to

B

i

.

e Second

derivative of

log £

with respect to

Bi.

f

Expected

value of second derivative of

log £

with respect to

B

i

.

second derivatives of

log

G with respect to

O

i

and

0

j

:

observed values

(23)

Table IV

gives

characteristics of the data structure and model of

analysis

for

applied examples

of multivariate

analyses

of beef cattle data. The first is a bivariate

analysis

fitting

an animal model with maternal

permanent

environmental effects as an additional random

effect, ie, estimating

nine covariance

components.

The second

shows the

analysis

of three

weight

traits in Zebu Cross cattle

(analysed

previously;

see

Meyer

(1994a))

under three models of

analysis,

estimating

up to 24

parameters.

For each data set and

model,

the

computational requirements

to obtain

deriva-tives of the likelihood were determined

using

forward and backward differentiation

!NR:

Newton-Raphson;

MSC: method of

scoring;

MIX:

Starting

as

MSC, switching

to NR when

change

in

log

likelihood between iterates

drops

below

1.0;

and DF: derivative-free

algorithm.

b

CB:

Cannon bone

length;

HH:

hip height;

WW:

weaning weight;

YW:

yearling weight;

FW: final

weight.

!1:

simple

animal

model;

2: animal model

fitting

dams’ permanent environmental

effect;

and 5: animal model

fitting

both

genetic

and

permanent environmental effects.

d

A:

Forward

differentiation, processing

all derivatives

simultaneously;

B: Forward

differentiation, processing

second derivatives one at a time;

C: Backward differentation. 0: No

reparameterisation,

unmodified;

S:

reparameterised

scale, optimising

step

size,

see

[47]

in text; T:

reparameterised scale, modifying diagonals

(24)
(25)

as described

by

Smith

(1995).

If the memory available

allowed,

forward

differentia-tion was carried out for all derivatives

simultaneously

and

considering

one matrix

8

2

M/

8

()

i

8()j

at a time. Table IV

gives computing

times in minutes for

calcula-tions carried out on a DEC

Alpha Chip

(DIGITAL)

machine

running

under

OSF/1

I

(rated

at about 200

Mflops).

Clearly,

the backward differentiation is most

compet-itive,

with the time

required

to calculate 24 first and 300 second derivatives

being

’only’

about 120 times that to obtain

log L only.

Starting

values used were

usually ’good’,

using

estimates of variance

components

from

corresponding

univariate

analyses throughout

and

deriving

initial guesses for covariance

components

from these and literature values for the correlations concerned. A maximum of 20 iterates was carried

out,

applying

a very

stringent

convergence criterion of a

change

in

log L

less than

10-

6

between iterates or a value

for the norm of the

gradient

vector of less than

10-

4

as above.

The numbers of iterates and additional likelihood evaluations

required

for each

algorithm

given

in table IV show small and inconsistent differences between them. On the

whole, starting

with a MSC

algorithm

and

switching

to NR when the

change

in

log

G became small and

applying Marquardt’s

(1963)

modification

(see !48!)

tended to be more robust

(ie,

achieve convergence when other

algorithm(s) failed)

for

starting

values not too close to the eventual estimates.

Conversely, they

tended

to be slower to converge than an NR

optimising

step

size

(see

!47))

for

starting

values very close to the estimates.

However,

once the derivatives of

log

G have

been

evaluated,

it is

computationally

relatively inexpensive

(requiring

only

simple

likelihood

evaluations)

to compare and switch between

algorithms,

ie,

it should be feasible to select the best

procedure

to be used for each

analysis individually.

Comparisons

with EM

algorithms

were not carried out for these

examples.

However,

Misztal

(1994)

claimed that each sparse matrix inversion

required

for

an EM

step

took about three times as

long

as one likelihood evaluation. This

would mean that each iterate

using

second derivatives for the three-trait

analysis

estimating

24

highly

correlated

(co)variance components

would

require

about the

same time as 40 EM

iterates,

or that the second derivative

algorithm

would be

advantageous

if the EM

algorithm

required

more than 462 iterates to converge.

DISCUSSION

For univariate

analyses,

Meyer

(1994b)

found no

advantage

in

using Newton-type

algorithms

over derivative-free REML.

However, comparing

total cpu time per

analysis, using

derivatives appears to be

highly advantageous

over a

derivative-free

algorithm,

for multivariate

analyses,

the more so the

larger

the number of

parameters

to be estimated.

Furthermore,

for

analyses involving larger

numbers of

parameters (18

or

24),

the derivative-free

algorithm

converged

several times to a

local

maximum,

a

problem

observed

previously by

Groeneveld and Kovac

(1990),

while the

Newton-type

algorithm appeared

not be affected.

Modifications of the NR

algorithm,

combined with a local search for the

optimum

step

size,

improved

its convergence rate. This was achieved

primarily by

safe-guarding against

steps

in the

’wrong’

direction in

early iterates,

thus

making

the

search for the maximum of the likelihood function more robust

against

bad

starting

Références

Documents relatifs

In this paper we are concerned with the application of the Kalman filter and smoothing methods to the dynamic factor model in the presence of missing observations. We assume

Meyer K (1989) Restricted maximum likelihood to estimate variance components for animal models with several random effects using a derivative-free algorithm. Genet

Restricted maximum likelihood to estimate variance components for animal models with several random effects using a derivative-free algorithm... Original

Estimation of these parameters, the largest eigenvalues and per- taining eigenvectors of the genetic covariance matrix, via restricted maximum likelihood using derivatives of

An ’average information’ restricted maximum likelihood algorithm for estimating reduced rank genetic covariance matrices or covariance functions for animal models with equal

The Laplacian procedure for finding the mode of the marginal posterior distribution of a single variance component in a Poisson mixed model was found to be

Abstract - This paper discusses the restricted maximum likelihood (REML) approach for the estimation of covariance matrices in linear stochastic models, as implemented

Computation of all eigenvalues of matrices used in restricted maximum likeli- hood estimation of variance components using sparse matrix techniques... Original