Restricted maximum likelihood to estimate variance components for animal models with several random effects using a derivative-free algorithm

(1)

HAL Id: hal-00893806

https://hal.archives-ouvertes.fr/hal-00893806

Submitted on 1 Jan 1989

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Restricted maximum likelihood to estimate variance

components for animal models with several random

effects using a derivative-free algorithm

K. Meyer

To cite this version:

(2)

Original

article

Restricted

maximum

likelihood

to

estimate variance

components

for

animal

models with

several

random

effects

_using

a

derivative-free

algorithm

K. _Meyer

Edinburgh University,

Institute

_of

Animal

Genetics,

West Mains

_{Road, Edinburgh}

EH9

3JN, Scotland,

UK

(received

21 March

_{1988, accepted}

11

_January

₁₉₈₉₎

Summary - A method is described for the simultaneous estimation of variance

compo-nents due to several

_genetic

and environmental effects from unbalanced data

_by

restricted maximum likelihood

_(REML).

Estimates are obtained

by

evaluating

the likelihood

explic-itly

and

_using

_standard,

derivative-free

_{optimization procedures}

to locate its maximum. The model of

_analysis

considered is the so-called Animal Model which includes the ad-ditive

_genetic

merit of animals as a random

effect,

and

_incorporates

all information on

relationships

between animals.

_Furthermore,

random effects in addition to animals’ ad-ditive

_{genetic effects,}

such as maternal

_genetic,

dominance or _permanentenvironmental

effects are taken into account.

Emphasis

is

_{placed entirely}

upon univariate

analyses.

Sim-ulation is

_employed

to

_investigate

the

_efficacy

of three different maximization

_techniques

and the scope for

approximation

of

_sampling

errors.

_Computations

are illustrated with a

numerical

_example.

variance _{components -}restricted maximum likelihood - animal model - additional

random effects - derivative - free _approach

Résumé - Utilisation du maximum de vraisemblance restreint et d’un _algorithme

sans _dérivation,_pourestimer les composantes de variance d’un caractère, selon un

modèle animal avec _plusieurseffets aléatoires. On décrit une méthode _{pour estimer}

simultanément les _composantesde la uartance "d’un

seul

caractère,

dues au milieu ou

plusieurs

effets

génétiques.

La méthode admet des données

_{non équilibrées}

et se

fonde

sur le maximum de vraisemblance restreint

_(«

_REML»).

Les

_composantes

estimées sont obtenues par l’évaluation

_explicite

de la

_fonction

de

_{vraisemblance,}

dont on recherche le maximum par des

techniques générales d’optimisation,

ne nécessitant _pasle calcul des dérivées. Le modèle

_d’analyse

est un «modèle

animal»,

où l’on considère la valeur

_génétique

individuelle des animaux comme un

effet

_aléatoire,

et tient

_compte

de toute

_{l’information}

généalogique disponible.

Des

_effets

aléatoires

_{complémentaires}

_(effets

maternels

_{génétiques,}

effets

de

_dominance,

_effets

de milieu

_permanent)

sont aussi

_pris

en compte. La simulation

est utilisée pour évaluer

l’efficacité

de trois

_techniques

de

maximisation,

et pour déterminer

approximativement les distributions des estimateurs. Les calculs sont illustrés par un

exemple numérique.

(3)

INTRODUCTION

Over the last

_decade,

restricted maximum likelihood

_(REML)

has become the

method of choice for

_estimating

variance

_components

in animal

_breeding

and related

_{disciplines trying}

to

_partition

the

_phenotypic

variation into

_genetic

and

other

_components.

This has been facilitated not

_{only by}

an increase in the

_general

level of

_{computational}

resources

_available,

but

_by

the

_development

of numerous

specialized

algorithms, exploiting specific

features of the data structure or model

of

_analysis

as well as

_utilizing

a

_variety

of numerical

_techniques.

So

far,

REML has found most

_practical

use in the

_analysis

of

_dairy

cattle data

under a &dquo;sire model&dquo;. For this

_model,

records of _progenyare used

_only

to obtain

information on half of their sires’

_{breeding value,}

while dams and

_{relationships}

between females are

_{ignored. Recently,}

interest has increased in more detailed

models,

in

_particular

the

_{conceptually simplest}

_breeding

value or

&dquo;Animal Model&dquo;

(AM)

where each record is taken to

_provide

information on the additive

_genetic

merit of the animal measured.

_{By including}

animals which do not have records themselves but are

_parents,

this allows for all information on

_{relationships}

to be

taken into account.

A

_large

_proportion

of REML

_applications

have been restricted to models with

one random factor

_{(e.g. sires)}

_apartfrom random residual _errors,

_estimating

two variance

_components

_only

in a univariate

_analysis,

or _p

_(p

_-f-1)

for a multivariate

analysis

of p traits. While

algorithms

for more

_complicated

models have been

described, they

are

_by

and

_{large computationally demanding.}

Often

_they

involve

inversion of a matrix of size

_equal

to the total number of levels

of all

random effects

fitted. This can be

_prohibitive

for

_practically

sized data sets. Thus REML has found

comparatively

little use so far for models

_fitting

several random effects.

Maximum likelihood estimation

_involves,

_{by definition,}

location of the maximum

of the likelihood function for a

_given

set of

data,

model of

_analysis

and

_parameters

to be estimated.

_Estimating

variance

_components

for unbalanced data

_generally

requires

iterative schemes. Standard textbooks on numerical

_{analysis classify}

pro-cedures to find the

_optimum

_(minimum

or

maximum)

of a function

_according

to

the amount of information

_required

from derivatives of the function. The so-called

Newton methods utilize both first and second

_derivatives,

i.e.

_{geometrically}

speak-ing slope

and

_curvature,

and are thus

_quickest

to _converge.Methods

_relying

on first

derivatives

_only

include

_steepest

_descent,

_{conjugate gradient}

and

_Quasi-Newton

pro-cedures

_{approximating}

second derivatives.

_Finally,

there are derivative-free methods

involving

direct search

_strategies

or numerical

_{approximation}

of derivatives

_(see

for

example

Gill et

_al.,

_1981).

In the

_main,

REML

_{algorithms currently employed}

in animal

_breeding

fall

into the first two

_categories.

Fisher’s Method of

_Scoring

is a

_special

case of the

Newton

_{procedures, requiring expected}

values of second derivatives of the

_log

likelihood function

_(G)

to be evaluated. As these are often difficult to

_obtain,

Expection-Maximization

(EM) type

algorithms

(Dempster

et

_al.,

_1977),

_exploiting

first derivative

_information,

are used more

_widely.

A derivative-free REML

_algorithm

has been

_{suggested by}

Graser et al.

₍₁₉₈₇₎

for univariate

_analyses

to estimate the additive

_genetic

and error variance under

(4)

procedure

was suitable for data from

_large

selection

_experiments

_involving

several

thousand animals.

This _paperdescribes the use of a derivative-free

_approach

to estimate variance

components

by

REML for AMs which include not

_only

animals’ additive

_genetic

merit but also additional random

effects,

and thus cover a wide range of

mod-els suitable for the

_analysis

of animal

_breeding

data. Univariate

_{analyses only}

are

considered at

_present,

extensions to multivariate situations will be discussed else-where.

CALCULATING THE LIKELIHOOD

The Model Let:

denote the linear model of

_analysis

with: Y the vector of N

_{observations,}

b the vector of NF fixed effects

_(including

_anylinear of

_higher

order

_covariables)

X the NxNF incidence or

_design

matrix for fixed effects with column rank

NF

*

,

u the vector of all NR random effects

_fitted,

Z the NxNR incidence matrix for random

_effects,

and

e the vector of N random residual errors.

Assume:

&dquo;’{TI--B r&dquo;I

which

_gives:

The mixed model

equations

(MME)

pertaining

to

_[1]

are then

(Henderson, 1973):

or C

F

= _r.If C is _notof full

rank,

_asit is often the

case, estimates for b are not

unique.

The Likelihood

REML

_operates

on the likelihood of linear functions of the data vector with

expectations

zero, so-called error

_contrasts,

or,

equivalently,

on the

_part

of the likelihood

_(of

the data

_vector)

which is

_independent

of fixed effects. This results

in the loss in

_degres

of freedom due to

_fitting

of fixed effects

_being

taken into account

_(Patterson

&

_Thompson,

_1971).

For

_{Y !N(Xb, V),}

the

_log

likelihood is

(e.g.

Harville

_1977):

-

-where X*

_(of

order

_NxNF

_*

₎

is a full rank submatrix of X.

_Using

matrix

equalities

(5)

where C* is the coefficient’matrix in

_[2]

with X

_{replaced by}

_X

_*

_,

and P is a matrix:

Calculation of the first two terms

_required

in

_[4]

_depends

on the

specific

structure

of R and G in a

_{given analysis.}

The latter

_two,

_however,

can be determined in a

general fashion,

as

_{suggested by}

Graser et al.

_(1987),

_by

Gaussian Elimination

_(as

described in most Numerical

_Analysis

textbooks,

or

by

Smith & Graser

(1986))

applied

to _{the mixed model array: the}coefficient matrix in

_[2]

_{augmented by}

the

right

hand side and a

_quadratic

in the data vector.

Calculation

_ofy’Py

and

_log

_C*!

The mixed model _arrayfor

_[1]

is:

&dquo;Absorbing&dquo;

rows and columns

_pertaining

to random effects into the rest to the

matrix then

_gives:

and

_eliminating

rows and columns for fixed effects

_{correspondingly,}

_yields

y!Py,

the

_weighted

sum of

_squared

residuals

_required

to evaluate

_log

.C.

_Absorption

is

most

_easily

carried out

_by

Gaussian elimination:

_{repeated absorption}

of one row

and column at a time. This will also allow

_log

_{I C}

_*

_to

be determined

_{simultaneously.}

Subdivide M of size KxK

_(K=NF

₊

_NR

₊

₁₎

with elements _m2!and column vectors mi into rows 1 to

_K—1,

and row K:

Partitioned matrix results then

give

with

Mx_

1 *

=

M

K

-

1 -M

KM

’K/M

KK

=

I

M

ij-

M

i

K

mjk!mKK} _ fm

ij

*

}

the matrix

resulting

when

_{&dquo;absorbing&dquo;}

row and column

K,

or

&dquo;pivoting&dquo;

on

_mK

_x

_.

_Repeated

use of this result shows that the

_required

determinant is then

_simply

the sum of the

log

of

_pivots

_log

mii*,

i =

_2,...,

_K)

_arising

when

_absorbing

all rows and columns of

M into the first _row,as

_required

to evaluate

_jrpy.

If X is not of full

_rank,

M has

to be set _up

_replacing

X

_by

X* _or,

_{equivalently, absorptions}

have to be carried out

(6)

Univariate

_analyses

Results

_presented

so far hold for any model of form

(1!.

Consider

now a univariate

analysis

with

_identically

and

_{independently}

distributed errors, i.e.

For

_given

values of the other variance

_components,

the error variance can be

estimated

_directly

in this _case,from the residual sums of squares as

(see

Harville,

1977;

or Graser et

_al.,

₁₉₈₇₎

Let the other

_parameters

to be

estimated,

i.e.

_{(co)variances}

of the random effects

fitted,

be denoted

_by

oi with i =

1, ..., p - l,

and p the total number

of

components

with up =

0-2 E*

As discussed

_by

Harville & Callanan

_(1988),

a function of REML estimates of a set of

_parameters

is also the REML estimate of this function.

_Hence,

instead of

maximizing

log

G with

_respect

to the _p

_components

Qi , we can

reparameterize

to

9 and

_{p—1 functions fi}

_(!i,

_(TÐ

of the other

_components

and the error variance.

An obvious choice is to _expressthe Qi as a

_proportion

_(À

_i

i/u2 )

of the

latter,

so

that

_having

found REML estimates of

_u

_§

_and

the

_,

_a

_i

we

can

estimate

6 i

=

62 E’

Furthermore,

for fixed values of

_A

_i

_,

_log

G attains its maximum with

respect

to

u§ at

the REML estimate of

_{U2 E.}

This allows estimation to be conducted in two

steps:

Firstly,

a &dquo;concentrated&dquo;

_likelihood

is maximized with

respect

to the

_A

_i

_only

which

_yields

REML estimates

!i.

_Secondly,

_&2

_is

obtained

_(from

_[9])

for the

iz

(Harville

&

_Callanan,

_1988).

The

_advantage

of this

_approach

is that it reduces the dimension of the numerical search for the maximum of

_{log L}

_by

one. As the number

of iterates and likelihoods to be evaluated to find the maximum of

_{log L}

_usually

increases

_{substantially}

with the number of

_parameters

to be

_estimated,

this can

lead to a considerable

_saving

in

computational

resources

_required.

From

_[8]

it follows

_immediately

that:

Log

!G!

depends

on the random effects fitted. For the

_simplest

model with

animals as the

_only

random

effect,

as considered

by

Graser et al.

(1987):

where

_QA

is the additive

_genetic

variance,

A the numerator

_relationship

matrix

between

_animals,

a the vector of

_(direct)

_genetic

effects for

_animals,

and NA denotes

the number of animals. Since

_log

_IAI

does not

_depend

on the

parameters

to be

estimated,

it is a constant and does not need to be calculated in order to maximize

log

G. The inverse of A is

_required

in

_[6]

_(for

_G-

₁

₎

_though,

but this can be set up

efficiently

from a list of

pedigree information, following

rules

described,

for

instance,

by Quaas

(1976).

(7)

Let c of

length

NC denote a vector of such effects to be included in the model of

analysis,

with .

This

_gives:

In other _cases,the model of

_analysis

_mayinvolve two random effects for each animal. Let m, of

length NA,

denote the second animal effect and assume each

element has variance

_a

_’

_.

If there are

_repeated

records _peranimal for a

_trait,

m

represents

the

_permanent

effects due to

_animals,

_excluding

additive

_genetic

effects.

These are

usually

assumed to be uncorrelated with any other effects in the

model,

so that

If m had variance

0&dquo;!1

D,

[13]

would be

augmented by log

!D!.

As with

log

IAI,

this term is constant and does not need to be evaluated. Note

_though

that

G-

1

and

consequently

D-

1

is

_required

in

_(6!.

A

_{typical example}

for this kind of structure is

a model where m stands for dominance

effects,

u m 2

for the

respective

variance and

D for the dominance covariance matrix among animals.

For other

traits,

for

_example

measures of

reproductive performance,

we

distin-guish

between a direct and a maternal

(or paternal)

additive-genetic

_component,

allowing

for a covariance between the two. In that

_situation,

there may not be a

record

_supplying

information on m for each

animal,

but information is

acquired

indirectly

via links

arising

from the

_genetic

covariance and

_{relationships.}

With UAM

denoting

the covariance between a and m and rAM the

corresponding

correlation,

and

_partitioned

matrix results

_give

For all models discussed so

far,

computational

requirements

to determine the

part

of

_log

_!G!,

which

_depends

on the

_parameters

to be

_estimated,

are trivial. This

results from random effects

_being

either

_{uncorrelated,}

so that G is

blockdiagonal

(!12!

and

_!13!),

or G

being

the direct

product

of a matrix of

_parameters

and a matrix

describing

correlations

_amongst

levels of random effects as in

_[14].

Extensions to

other models are

straightforward,

as

_long

as G can be

partitioned

into blocks of

such structure. For

_example,

fitting

permanent

environmental effects

_(c)

as well as

direct and maternal additive

_genetic

effects,

[14]

would be

_augmented

simply

by

(NC

log

&OElig;ð),

provided

c was uncorrelated to a and m. Table I summarizes

log

,C

for 10 models which may arise in the

analysis

of animal

breeding

data,

with up to 3 random effects and

involving

up to 5

(co)

variance

components.

Otherwise,

G

(or

(8)

using techniques

as described above for

log

!C*!.

For

instance,

if G contained a

block of form

the contribution to

_log

_IGI

would be

Assume

_{V(a )}

=

QA

A,

Cov(a,

c’)

=

Cov(m,

c’)

= 0 and

V(c)

=

o- c

I

for all models.

Terms are assumed to be the result of Gaussian Eliminations

performed

for M with

_aE

2 factored out.

(9)

Computational

Considerations

Typically,

the

_augmented

coefficient matrix M _{is very}

_large

but also very sparse. Hence use of sparse matrix

techniques, storing

the non-zero elements of M

_only,

is

_advantageous

and allows

_matrices

of order of thousands to be handled. Since M

is

symmetric,

only

the lower

_(or

_{upper) triangle}

is

_required.

One form of _sparse

matrix

_storage,

described in standard text books such as Knuth

(1973),

is a

so-called &dquo;linked list&dquo; . Such linked

_lists,

one list for each row of M in

_conjunction

with a vector

_pointing

to the first element in each _row,are well

_suited,

and allow

the Gaussian Elimination

steps

required

to evaluate

_{y!Py and}

_log

_{!C* !}

to be carried

out

_efficiently.

In

_setting

up

M,

the order of

_equations

can be of vital

_importance

as it affects the &dquo;fill-in&dquo;

_during

the

_absorption

_process,i.e. the number of additional non-zero

off-diagonal

elements

_arising.

For

_{computational efficiency}

this should be

kept

as small as

_possible.

There is extensive literature concerned with numerical

operations

on

sparse matrices. Tewarson

(1973),

for

example,

discusses

techniques

for the choice

of

_pivot

in each Gaussian Elimination

_step

which

_yields

the least local

_fill-in,

and also considers the _scopeof a

priori

column

permutations.

A number of

strategies

for

re-ordering

matrices

_exists,

often

_{utilizing graph theory;}

see for instance Duff et al.

(1986).

Such

_{general techniques, making}

little or no

_assumptions

about the matrix

structure can be

_{computationally}

_expensive.

This _maybe

_prohibitive

for situations

where the direct solution of a

_large

sparse

system

of

_equations

is

_required

a few

times

_only,

but _maybe worthwhile for our

_application

where numerous likelihood

evaluations are to be

_performed.

Future research should consider this

_topic.

In the

_meantime,

critical

_inspection

of the data and

_relationship

structure with their

_implications

for the

_pattern

of

_off-diagonal

elements in the mixed model array,

and

_{judicious ordering}

of effects _mayachieve a

_large

_proportion

of the

_potential

benefits from

_{general reordering algorithms.}

A standard

_strategy

in

_attempting

to minimize fill-in is to process rows with the fewest

_{off-diagonals}

first. Graser et al.

(1987)

therefore

_suggested

selection of

_pivots

_{corresponding}

to the

_youngest

animals first. For the models with several random

_effects

for each

animal,

these should be

assigned

to successive rows. In other _cases,it may be

_possible

to

_exploit

additional features of the data structure. For data from a

_{multi-generation}

selection

_experiment

with selection within

_families,

for

_{example, grouping}

of animals

_according

to female &dquo;founders&dquo; _appears

_preferable

to a

_grouping

_according

to

_generation.

On the

other

_hand,

if animals are nested within

_contemporary

_(fixed)

_{groups, it}_maybe

advantageous

to order

_equations

so that animals

_directly

follow their group effects.

For R of form

_(8],

_QE!’

_{is usually}

factored from

_(6].

In

this

case, calculations to

determine

_Y

_Py

and

_log

_[C

_*

as described

_above,

do not

_yield

the terms

_required

in

_[4]

_directly,

but

_(y

_p

_y

_!E)

and

_{(log IC}

_I

_*

+

_(NF

_*

_+NR)

_log

_U2

_),

which has to be

born in mind when

_assembling

the likelihood.

MAXIMIZING THE LIKELIHOOD

Choice of a

_strategy

to locate the maximum of the likelihood

function,

or

equiva-lently

the minimum of -2

_log

_G,

is determined

_by

several considerations.

_Firstly,

(10)

more

_demanding

than _anycalculations

_{required by}

the

_{optimization procedure}

as

such.

_Hence,

a method which

_requires

a small number of function evaluations in

each iterate is desirable.

_Secondly,

the

_procedure

should be

_robust,

i.e. cope with

starting

values for

_parameters

_considerably

different from the eventual

_estimates,

and should be little affected

by problems

of numerical accuracy,

yielding sufficiently

precise

estimates of the minimum even for _veryflat functions.

Thirdly,

constraints on the

_parameter

_spaceshould be accommodated

_{and, preferably,}

not

_require

extra

function values or reduce the

speed

of convergence.

The

_suitability

of three different

_approaches

was examined

_using

simulated data

for models

_{1, 2,}

4 and 8 as

_specified

in Table I. Records were

sampled according

to

the model of

_analysis

for one or several

generations

(up

to

_four),

each

_comprising

a

_given

number of full-sib families

(ranging

from 25 to

₈₀₀₎

of variable size

₍₂

to

10),

with dams nested within sires and each sire mated to a

specified

number of

dams

₍₁

to

_5).

Error variances were estimated

_directly,

while all other

_components

were

_expressed

as a

_proportion

of the

phenotypic

variance, i.e.,

8 A

,

Om,

Oc

and

O

AM

for

_{a 2, 0fi, 02}

and QpM,

respectively. Obviously, B

A

is the

heritability

and

Oc

what is

_commonly

referred to as

&dquo;c

2

effect&dquo; . As described

above,

this reduced

the dimension of search to

_{1, 2,}

3 and 4 for Models

1,

2,

4 and

_8,

_{respectively.}

This

_{parameterization}

rather than

_expressing

components

as a

_proportion

of the error variance

(À¡)

was chosen since it allowed checks for

_parameter

estimates out

of bounds more

_{readily and,}

for the limited cases

_examined,

as it

appeared

to be

more robust

_against

bad

_starting

values.

Quadratic approximation

For a model with animals as the

only

random

effect,

Graser et

al.

(1987)

fitted a

quadratic

function in r =

0,2!la2 E to

the

log

likelihood,

predicting

the

maximum of

log G

as the maximum of this function. For one

parameter,

this

_required

function values for 3 different r values _per

_{approximation. Having}

calculated

log £

for 3 initial

points,

each iterate then involved one function

_evaluation,

for r* which maximized

the

_quadratic

function of the

_previous

step.

This value and those

_pertaining

to the two r values either side closest to r* were then utilized in

determining

the next

quadratic

approximation

to

_log

G. As

_{reported by}

Graser et al.

_(1987),

simulations for this model showed

_rapid

convergence. A bad initial guess for r

_generally

did not

affect the estimation

_{procedure greatly,}

as

_long

as the three

points

in the initial

approximation spanned

a

sufficiently large

range.

Though

the number of iterates

and likelihood evaluations

_required

tended to

_increase,

the same maximum of

log

G as for

_{&dquo;good&dquo;}

starting

values was attained without

_increasing

computational

demands

_excessively.

This

_approach

extends to the case of

multiple

_parameters.

For

_t,

with elements

Bi,

denoting

the vector of

_parameters

with

_respect

to which

_log

_G

is to be

_maximized,

and

_{log G (t)}

the

_{corresponding log likelihood,}

the

_{quadratic approximation}

is:

(11)

For p

parameters,

a total of z = 1 +

p(p +3)/2

different values of t and

log

G

_(t)

are

required

in each iterate to set _upand solve a

_system

of z

_equations

for the

_intercept

_q,the vector of linear coefficients _qand the

_symmetric

matrix

of

_quadratic

coefficients

_Q.

This number increases

_rapidly

with the number of

paramaters,

e.g, z =

_{6, 10,}

15 and 21 for

p =

2, 3,

4 and

5,

respectively.

For one

_parameter,

choice of the

_point

to be

_replaced

in each iterate was

straightforward.

In the multi-dimensional case,

however,

it was less obvious. Two

strategies

were

_explored.

After z initial

_points

had been

obtained,

the first

involved,

as for _p

=1,

in the

regular

case one function evaluation _per

iterate,

i.e. calculation

of

_log

G

_(t

_*

₎

for t* from the last iterate. This new

_point

was added to the set of z

points

which formed the basis to

_predict

t* in the

_previous

_step.

The worst of the

resulting

set of z + 1

_points

was then

eliminated,

and a new vector t* determined. If the

_quadratic

approximation failed,

i. e. if

_{log £}

_(t

_*

₎

was lower than all z function

values in the

set,

t* was

_{replaced by}

(t

*

+

) /

t

m

2,

where

t

nt

was the

_parameter

vector with

_highest

function value in the set. If _{necessary, this}was

repeated

until

the

_replacement

was successful.

_Hence,

each iterate increased the average likelihood

of the z current

_points.

The second

_strategy

_comprised

z function evaluations per iterate. Given a vector

of

_starting

values

to

(t

*

from the

iterate),

np vectors

t

i

were derived

by

multiplying

the i-th element of

to

by

a factor

reflecting

a chosen

_step

_size,

1.10 for

steps

of

10%

in this case.

_Following

a scheme described

_by

Nelder & Mead

(1965),

further

_parameter

vectors were then determined as

(ti

+

_{tj )}

_/

2 for

_{i < j =}

0,

..., p.

This

_yielded

the

_required

total of z

grid

_points

and

subsequent

estimate t*. For

both

_strategies,

all vectors t were checked for elements out of the

parameter

space,

and if necessary these were set to their

_respective

bounds.

The

_quadratic

_{approximation}

_performed

well for Model

_2,

_though,

for the limited number of

_examples

_considered,

it was not

_consistently

better than the

two alternative

_procedures

studied,

in terms of the number of likelihood evaluations

required.

For Models 4 and

_{8, however,}

where the data structure was such that

only

a small

_proportion

of animals had direct information on the second

genetic effect,

problems

of numerical accuracy occurred. Often the

system

of z

_equations

to be solved was indeterminate or almost so.

Typically

this

yielded

non-positive

definite

estimates of

_Q

and useless

_predictions

of t*. For the second

_strategy,

an alternative

approach, slightly

more

robust,

was tried. This consisted of

_estimating

elements of

q and

Q by

numerical

differentiation,

i.e. as forward-difference

_{approximations}

to

the first and second derivatives of

_log

G,

respectively.

On the

whole,

quadratic

approximation

of the likelihood function

_involving

multiple

parameters

appeared

to be unsuitable as a

_general

search

procedure.

For a one-dimensional

search, however,

it

performed consistently

best among the 3

strategies

examined.

Quasi-Newton

Procedures which do not

_require

second derivatives of the function to be

_minimized,

but

approximate

the Hessian matrix

₍₌

matrix of second

_derivatives)

are referred

to as

Quasi-Newton

methods. This

approximation

is

usually performed iteratively,

(12)

vectors of

_changes

in

_gradients

₍₌

first

_derivatives)

and estimates between iterates. While most

_Quasi-Newton

_{procedures require}

first

_derivatives,

some are

derivative-free, approximating

the vector of

_{gradients using}

finite differences. These have been. found to show

_quadratic

convergence and are recommended as the derivative-free

methods to be used for smoth functions with continuous derivatives. Further details

are

beyond

the _scopeof this _paperand can be found in standard

textbooks,

for

instance Gill et al.

_(1981).

Statistical

_library

subroutines to find the minimum of a function

_using

a

Quasi-Newton

_{algorithm, namely}

NAG routine E04JBF and IMSL routine

_ZXMIN,

have been

_applied

to -2

_log

₁

_:-.

These routines

require

the user to

_supply

a subroutine to

evaluate the function to be

_{minimized, passing}

the number and vector of

_parameters

as

arguments

and

_returning

the function value. In

_{addition, starting}

values for the

parameters,

the maximum number of iterates or function calls allowed and some

criteria of _accuracyof evaluation

_required

and

_rounding

errors tolerated have to be

specified.

E04JBF

_provided

the

_facility

to constrain

_parameters

between fixed _upper

and lower bounds

_(the

IMSL

_equivalent

ZXMWD was not

_tested),

i.e. 0 to 1

for

₀

_A

_{, 0}

_M

and

0c,

and -1 to 1 for

B

AM

.

However,

to

_impose

these

_constraints,

the routine

_required

function

values,

setting

all

_parameters

_{simultaneously}

to

their upper or lower limits.

Obviously,

this violated other restrictions on the

parameter

space in the

genetic

context,

i.e. that the sum of

_components

is bounded

correspondingly

(0 <

8 A

+

8 M

+

9 c

+

9 n

n,t <_

1),

and that the absolute value of

the

_genetic

correlation has a maximum of

_unity

(9A

M

<_

0 A

0 M

).

Consequently,

log G

could often not be determined for the

_required

constellation since M became

negative

definite or

Y

Py

assumed a

_negative

value. Hence minimization was carried

out unconstrained.

_Techniques

to

_implement

more

complicated

constraints

exist,

and further research should

_investigate

their

_suitabililty

for the kind of models which are

of

interest in animal

_breeding.

Unless a

_parameter

vector was encountered for which -2

_log

G could not be

evaluated,

the

_{Quasi-Newton algorithms performed}

well for all models examined. Each iterate

_required

_p

function

evaluations to

approximate

the vector of first derivatives. The number of iterates

_{performed depended}

on

_{user-specified}

criteria of accuracy and maximum number of function evaluations allowed. If likelihood functions were _wery

flat,

routines would

_stop

before the minimum of -2

_log

G was

determined as

_accurately

as

_{desired, flagging problems}

of numerical accuracy.

Figure

1 illustrates the

_typical

pattern

of

_changes

in likelihood and estimates

observed for an

_analysis

under Model 4 for a

_{&dquo;good&dquo;}

and a &dquo;bad&dquo; initial guess of

parameter

values. The simulated data for this

_{example comprised}

2

_generations

with 100 full-sib families of size 4 to 8 and 25 half-sib families each. Records were

sampled

for

_population

values of

_o-P

₌100

_{(phenotypic variance),}

9 A

=

0.50, Om

=

0.20 and IT AM = _-0.05.

Starting

values used were the

_population

values

(Set I)

and

ITA =

_{0.30, o-}

_M

= 0.15 and _ITAM = _0.05

_{(Set II),}

_{respectively.}

For Set

_I,

ZXMIN

required

93 likelihood

_evaluations,

for a

_given

_significance

level of 6

_significant

digits.

For Set

II,

however,

the routine used 204 function calls before it considered the maximum of -2

_log

G to be

_{found, although Figure}

1

_suggests

that likelihood

(13)

Simplex

The

_Simplex

method of Nelder & Mead

₍₁₉₆₅₎

is

_generally

advocated as the

derivative-free

_procedure

to use if the multivariate function to be minimized is

discontinuous,

though, initially,

it has been

_developed

with the maximization

of a likelihood function in mind. It relies on a

comparison

of function values

without

_attempting

to utilize any statistics related to derivatives of the function.

Such

_{optimization techniques}

are

generally

refered to as direct search

_procedures.

While

they

often have

been developed by

heuristic

_approaches

without

_proof

of

convergence,

they

have found to be effective in

_practice

_{(Swann, 1972).}

The

_Simplex

or

Polytope

method,

as some authors

prefer

to call it to avoid confusion with the

_Simplex

technique

used in Linear

_Programming,

was

_initially

suggested

by Spendley

et al.

_(1962).

It

_operates

on a set of

parameter

vectors and their

_pertaining

function

values,

which form the vertices of a

_simplex

in the

parameter

space, hence its name. As reviewed

_by

Swann

(1972),

it is based on

the

_conceptes

of

_{&dquo;evolutionary operations&dquo; , developed}

to

_optimize

_productivity

of industrial

_plants,

in which the

parameter

space is searched

following

some

_geometric

configuration.

The

_design

which

_requires

the least number of

_points

and hence

makes most efficient use of the function values

calculated,

is a

regular simplex.

This is defined

_simply

as a set of

_{mutually equidistant}

points,

n + for a

simplex

of

dimension n. For two

_dimensions,

for

_example,

the

_{regular simplex}

is an

equilateral

triangle.

A useful

_property

of a

_regular

is that a new

simplex

can be formed the

existing simplex by

addition of a

_single

new

_point.

The search

_proceeds

as follows. To

begin,

a

_simplex

of

specified

size is _{set up,}

including

the

_point

_representing

an initial guess for the

_optimum,

and

corresponding

function values are obtained. The aim in each iterate then is to

_replace

the worst

point,

i.e. for a minimization

problem

the

point

with the

highest

function. The new

_point,

defining

the next

simplex,

is chosen so as to preserve the

_geometric

shape

in a direction away from the discarded

point,

but

passing through

the center

(14)

repeated

until the

_simplex

straddles the

optimum.

Reducing

the size of the

simplex,

search then recommences with the new, smaller

design,

or terminates if the

simplex

becomes smaller than a

_specified

limit

(Swann 1972).

Subsequently,

the

_Simplex

method has been modified

_by

Nelder & Mead

_(1965).

Abandoning

the

_regularity

of

_design,

their

_procedure

allows the

_simplex

to rescale

itself

_{automatically}

in each

_iterate,

_changing

_shape

and size

_according

to the

landscape

of the surface searched. This

_adaptability

is achieved

_by

a combination

of so-called

_reflection,

_expansion

and contraction steps.

Consider a function

_F=f(t)

to be minimized with

respect

to _p

_independent

variables,

and let

_to

_{denot the guess}for the

_optimum

to start with. The initial

simplex

then

_comprises

to

and _pother

_points

t

i

(i

=

1,

..., p)

obtained

by

modifying

one co-ordinate of

_to

at a time

_by

a chosen

_step

size. Each iterate

_{begins by ordering}

and

_renumbering

the

_points

in the

_{simplex according}

to the

_pertaining

function values. The

_following

three

_points

are identified:

_to

with function value

_F

_o

is now

the

_currently

best

_point

in the

_simplex

_(F

_o

<

_F.t,

i =

1,

..., p),

tp

with function value

Fp

is the worst

_point

_(Fp

>

_Fi,

i =

_{0, ..., p-1)}

which is to be

_replaced

in this

_iterate,

and

_tp-,

is the next to worst

_point

_{(Fp_}

₁

<

_Fp

and

_{F!,_1}

>

Fi,

i =

_{0, ..., p -}

_2).

Next,

the center of the

_points

to remain in the

_simplex

is obtained as:

i.e.

_{by averaging}

each coordinate over the set of

_points.

A trial

_point

is then

generated by

&dquo;

reflecting&dquo; tp

towards the center:

where the reflection coefficient a is a

_positive

constant. The

_{corresponding}

function

value

_F,.

is

_compared

with

_F

_o

to

_Fp

to determine further

steps

in the iterate. Three

possible

outcomes are

_{distinguished.}

Firstly,

t

r

can be better than the worst

_point

but not better than the best

_point

(F

o

<

_F

_r

<

_Fp).

In this _case,

_t

_r

_{replaces tp}

and a new iterate is started.

_Secondly,

if

t,.

is the new best

_point

(F,.

>

_F

_o

_),

it is assumed that the direction of search has

been a

_good

one and search is

_{&dquo;expanded&dquo;}

in this direction

_by

_examining

a

_point

with

_{corresponding}

function value

_F

_e

_.

The

_expansion

coefficient ₇ is

_positive

constant. If

_F

_e

is less than

_F

_r

_,

the

_expansion

has been successful and

_t

_e

_replace

tp.

Otherwise

t

e

is discarded and

t,.

is substituted for

_tp

to

_complete

the iterate.

Thirdly,

t

r

may be worse than _anyof the

_remaining

_p

_points

(F

r

>

_{Fp_}

₁

_).

This leads to &dquo;contraction&dquo; of the

_{simplex, obtaining}

a new

_point

_t

_e

with

_pertaining

function value

F,,

as:

if

_F

_T

is less than

_Fp,

and as

otherwise. The contraction

_{coefficient 0}

is a constant in the _rangefrom 0 to 1. If

F

e

is less than the smaller of

_F

_r

and

_F

_N

_,

the contraction has been successful and

tc.

replaces tp

to end the iterate. If

_not,

the

_complete

simplex

is shrunk

_{by moving}

(15)

for i =

1, ..., p,

before _{a new}iterate is started.

Suggested

values for

a,

(

3

and 7

are

1.0,

0.5 and

2.0,

respectively

(Nelder

& Mead

1965).

Full

computational

details

_together

with a flow

_diagram

can be found in Nelder & Mead

(1965),

and a

FORTRAN

_{implementation}

for

_example

in O’Neill

_(1971).

Since differences in function values are not utilized in

_establishing

the direction

of

search,

restrictions on the

_parameter

_spacecan be

imposed easily by assigning

a

very

large

positive

function value to

parameter

vectors out of bounds. As

_pointed

out by

Nelder & Mead

_(1965),

this will be followed

_{automatically by}

a contraction

step

which will

_{eventually keep}

estimates within their bounds. The convergence criterion

_{suggested by}

Nelder

&

Mead

₍₁₉₆₅₎

is the

_variance,

or standard

_deviation,

of function values in the

_simplex,

rather than the more

_{conventionally}

used

_change

in estimates between iterates. The rationale for this was that in statistical

_problems

such as maximum likelihood

_estimation,

the curvature of the surface near the

minimum reflects the amount of information available on the

_parameters.

If the

surface is

_flat,

_sampling

errors are

_large

and it is not worthwhile to determine the

minimum with

_great

accuracy.

For REML estimation of

_multiple

variance

components,

the

_Simplex

method

proved

robust and easy to use. For a

_given

vector of

_{starting values,}

_to,

the initial

simplex

was made up of

to

and p vectors

t

i

,

obtained

_{by multiplying}

the i-th

element of

_to

_by

1.20. A factor this

_large,

_{corresponding}

to a

_step

size of

20%

was chosen to ensure

quick

_convergenceeven for bad

_starting

values. This

implied,

however,

that for

_starting

values close to the

_estimates,

the

_procedure

would search

unnecessarily

in the wrong direction. A smaller

step

size _maybe sufficient and

perhaps

more eflicient i.e.

_yield

estimates with less likelihood evaluations. For cases where

_parameter

vectors out of bounds were not a

_problem,

the

Quasi-Newton

algorithm performed generally

somewhat better than the

_Simplex

method. The variance of function values in the

_Simplex

had to be less than

10-

8

or even

10-

9

for

the

_Simplex

to find the minimum of -2

_{log £}

with the same _accuracyas ZXMIN

(for

an _accuracylevel of 5 or 6

significant digits).

In terms of

_changes

in

_parameter

estimates,

however,

a limit of 10-5 to 10-6

_{generally appeared}

sufficient.

Figure

2 illustrates the convergence behaviour of the

Simplex

method for the

same

_analyses

as

depicted

in

_Figure

1. Oscillations in likelihood and estimates

clearly

reflect the &dquo;trial and error&dquo; mechanism of this

_approach.

For both sets of

_starting

values,

88 likelihood evaluations were

required

before the variance of

function values in the

_{Simplex dropped}

below the

_specified

limit of

10-

9 .

SAMPLING VARIANCES

The matrix of

_approximate,

_large

sample

covariances among variance

component

estimates is

_{given by}

the inverse of the information matrix. This in turn can

be

_{approximated by}

the inverse of the Hessian

_matrix,

i.e. the matrix of second derivatives of the

_log

likelihood function with

respect

to the

_parameters

to be estimated. A

_quadratic

_{approximation}

of

_{log £}

and

_{Quasi-Newton procedures}

provide

an

_approximate

Hessian matrix

_(Q)

as a

_{by-product. Using}

the

_Simplex

method this has to be obtained

_explicitly.

Nelder & Mead

₍₁₉₆₅₎

described an

(16)

using

forward differences

_(see

_e.g.Gill et

al.,

1981).

While the mechanics for

approximating

second derivatives are

_{straightforward,}

it can be

problematic

in

practice.

As noted when

_examining

the

_quadratic

_{approximation}

of

_{log G}

for

_multiple

parameters,

Q

was often not

_positive

_definite,

whether derived

_{by least-squares}

as

described

_by

Smith & Graser

₍₁₉₈₆₎

or

_using

finite

_differences,

and

consequently

yielded

inappropriate

predictions

of the

_parameters

_{maximizing log}

G.

_Similarly,

the

_approximate

Hessian

_produced

_{by Quasi-Newton}

routines did not

_necessarily

give

meaningful

estimates of

_sampling

variances.

Indeed,

a

_{large proportion}

of the

literature on

Quasi-Newton

_algorithms

is concerned with

procedures

to deal with

bad or non

_positive

definite

_{approximations}

to the matrix of second derivatives. Simulation was

employed

to examine the

_sampling

distribution of variance

component

estimates and their

_predicted

variances for models

_1,

2 and 4. Data

were

sampled

for one to four

_generations

_consisting

of 25 to 800 full-sib families of

size 2 to 10 each. Dams were nested within sires with each sire mated to 1 to 5

dams. Variance

component

estimates were obtained

_using

the

Simplex

method with

population

parameters

as

_starting

values and a _convergencecriterion of

V(-2

log

G)

<

10-

8 .

Second derivatives were

_approximated

as described

by

Nelder & Mead

(1965)

using

a

_step

size of

0.1% of

the estimates.

Approximation

of

_sampling

variances worked well for

_{model 1,}

i.e.

_estimating

the additive

_genetic

and error variance

_only.

It was

_satisfactory

for model 2 which

included a additional environmental

_component

due to full-sib families

_(litters),

but it failed for

_analyses

under model

_4,

_including

a maternal

_genetic

effect as an

additional random effect for each animal.

Table II summarizes

_empirical

and

_{predicted sampling}

variances and

_empirical

correlations between estimates for a

_variety

of

design

for model 2. Results

clearly

emphasize

the need to ensure that the data

_provide

sufficient information to

(17)

high

negative

sampling

correlation between

_a

₂

_and

_{<7!. or,}

_{equivalently,}

between

heritability

and

c

2

effect. In terms of the likelihood

_surface,

this

_implied

a maximum

along

a flat

_ridge,

i.e. an area where for a constant sum of the two

_parameters

the value of the likelihood

_changed

_verylittle with

_changes

in the

parameter

values. If each sire was mated to

_only

one

_dam,

i.e. there were no half-sib

families,

there was

little scope to

_partition

the within

_family

variance into its

_genetic

and environmental

components.

This

_yielded

a likelihood surface of a

_shape

which did not allow

sampling

variances to be

_{approximated.}

Including

data for a second

_{generation provided}

the covariance between

_parents

and their

_offspring

as an additional source of

information,

thus

_allowing

a

consid-erably

better discrimination between

₀

_&dquo;1

and

_{o- c 2}

as evidenced

by

the decreased

(absolute value)

sampling

correlation between the two

_components.

While for one

generation

sampling

variances decreased with

increasing

number of dams per sire

the reverse

_generally

held for data

_spanning

several

_generations.

These

_{relationships}

are also illustrated in

_Figure

3 for

_analyses

under Model 4.

Considering

one

_generation

_only

_{(top row),}

no animal had records

_expressing

both

its direct and maternal

_genetic

effects. Hence the

_sampling

correlation between

0 &dquo;1

and

_am

₂

was

_large

and

_negative.

This was

_{accompanied by}

little variation in

estimates of QAM

depicting

that this was determined

_by

the

_genetic

covariance structure _{among animals}

_only.

For data

_including

3

_generations

_{(bottom row)}

sufficient

_comparisons

between and within

_generations

were available to

_yield

estimates of

_Q

_A

and

_{am 2}

_{virtually independent}

of each

other,

while estimates of 0

&dquo;

AM were

highly

variable and showed a

_strong

_negative

association with

or M, 2

NUMERICAL EXAMPLE

Table III contains simulated data for 282 animals in two

_generations.

Each genera-tion consists of 18 full-sib families of size 6 to

_10,

where each sire is mated to 3 dams. Parents of

_generation

one did not have

records,

which introduced 24 base

_animals,

yielding

306 animals in total. Records were

_{sampled according}

to Model 8

_(see

Ta-ble

_I)

for

_population

values of

_u

_£

=

_40,

u

%

=

_15,0&dquo;

_AM

=

_-5,

u$

= 10 and

U2

=

50,

a

_phenotypic

mean of 200 and a fixed effect of 20 for

_generation

2.

Data were

_analysed

under Models

_1,

_{2, 3, 4,}

7 and

_8,

as described in Table I.

Analyses

were carried out

_using

the

_Simplex

method to find the maximum of the likelihood. Two sets of

_starting

values were

_chosen,

_namely

the

population

values as a

_good

_(Set

I:

_B

_A

_{=0.40, 0Nj}

=

0.15,

O

AM

= -0.05 and

Oc

=

0.10)

and

B

A

=

0.10,

om

=

_{0.30, O}

_AM

=

_{0.10, Oc}

= 0.20

(=

Set

II)

_{as a}bad initial guess. For models other

than

8,

the

_appropriate

subset of

_parameters

was used. Estimates for two values for

the minimum variance of function values in the

_Simplex

are summarized in Table

IV. Characteristics of the

_augmented

coefficient matrix M and intermediate results for the first likelihood evaluated in each run are

_given

to

_help

with the validation of

_computer

_programs.Gaussian elimination

_steps

are not

_dependent

on the model of

_analysis,

hence the

_{example given by}

Graser et al.

₍₁₉₈₇₎

should suffice as an

illustration. Numerical

_examples

to be used in

_checking

the

_building

of the mixed model _arrayfor the various models can be found elsewhere in the literature. In

(18)

(19)

Restricted maximum likelihood to estimate variance components for animal models with several random effects using a derivative-free algorithm

HAL Id: hal-00893806

https://hal.archives-ouvertes.fr/hal-00893806

Submitted on 1 Jan 1989

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Restricted maximum likelihood to estimate variance

components for animal models with several random

effects using a derivative-free algorithm

K. Meyer

To cite this version:

Original

article

Restricted

maximum

likelihood

to

estimate variance

components

for

animal

models with

several

random

effects

using

a

derivative-free

algorithm

K.

Meyer

Edinburgh University,

of

Genetics,

Road, Edinburgh

3JN, Scotland,

(received

1988, accepted

January

1989)

genetic

by

(REML).

by

evaluating

explic-itly

using

standard,

optimization procedures

analysis

genetic

effect,

incorporates

relationships

Furthermore,

genetic effects,

genetic,

Emphasis

placed entirely

analyses.

employed

investigate

efficacy

techniques

approximation

sampling

Computations

example.

seul

caractère,

_using

_Meyer

_of

_{Road, Edinburgh}

_{1988, accepted}

_January

₁₉₈₉₎

_genetic

_by

_(REML).

_using

_standard,

_{optimization procedures}

_analysis

_genetic

_incorporates

_Furthermore,

_{genetic effects,}

_genetic,

_{placed entirely}

_employed

_investigate

_efficacy

_techniques

_sampling

_Computations

_example.

_{non équilibrées}

_(«

_REML»).

_composantes

_explicite

_fonction

_{vraisemblance,}

_d’analyse

_génétique

_aléatoire,

_compte

_{l’information}

_effets

_{complémentaires}

_(effets

_{génétiques,}

_dominance,

_effets

_permanent)

_pris

_techniques

_decade,

_(REML)

_estimating

_components

_breeding

_{disciplines trying}

_partition

_phenotypic

_genetic

_components.

_{only by}

_general

_{computational}

_available,

_by

_development

_analysis

_utilizing

_variety

_techniques.

_practical

_analysis

_dairy

_model,

_only

_{breeding value,}

_{relationships}

_{ignored. Recently,}

_particular

_{conceptually simplest}

_breeding