Goodness-of-fit for generalized linear latent variables models

(1)

Thesis

Reference

Goodness-of-fit for generalized linear latent variables models

CONNE, David

Abstract

Generalized Linear Latent Variables Models (GLLVM) enable the modelling of relationships between manifest and latent variables. These models are widely used in the social sciences.

In a latent variable framework, one works with several unobservable quantities (latent scores, parameters) and it is herefore essential to choose a model as close as possible to the data.

To test the appropriateness of a particular model, ne needs to define a Goodness-of-fit test statistic (GFI). Available GFI can be separated in two groups: first, GFI based on a comparison between the sample covariance and the model covariance of the manifest variables, which implies reducing the information that is contained in the data to their covariance structure, and secondly emph{Pearson}-type statistic when manifest variables are binary. In this work, we propose an alternative Goodness-of-fit statistic based on some distance comparison between the latent scores and the original data. This GFI takes into account the nature of each manifest variable and can in principle be applied in various situations and in particular with models with both discrete and [...]

CONNE, David. Goodness-of-fit for generalized linear latent variables models . Thèse de doctorat : Univ. Genève, 2008, no. SES 681

URN : urn:nbn:ch:unige-67331

DOI : 10.13097/archive-ouverte/unige:6733

Available at:

http://archive-ouverte.unige.ch/unige:6733

(2)

Goodness-of-t for Generalized Linear Latent

Variables Models

David Conne

Submitted for the degree of Ph.D in Eonometris and Statistis

Department of Eonometris

University of Geneva, Switzerland

Aepted on the reommendation of:

Dr. Eva Cantoni, University ofGeneva

Prof. Sophia Rabe-Hesketh,UniversityofCalifornia, Berkeley

Prof. ElvezioRonhetti, o-advisor, Universityof Geneva

Prof. Maria-Pia Vitoria-Feser, o-advisor, Universityof Geneva

Thesis n681

(3)

risé l'impression de la présente thèse, sans entendre, par là, émettre auune

opinion sur les propositionsqui s'y trouvent énonées et qui n'engagent que

la responsabilitéde leur auteur.

Genève, le9 otobre 2008

le doyen

Bernard MORARD

(4)

Generalized LinearLatentVariablesModels (GLLVM)enable the modelling

of relationships between manifest and latent variables. These models are

widely used inthe soialsienes. In alatentvariableframework,one works

with several unobservable quantities (latent sores, parameters) and it is

thereforeessentialtohooseamodelasloseaspossible tothe data. Totest

theappropriatenessofapartiularmodel,oneneedstodeneaGoodness-of-

t test statisti (GFI). Available GFI an be separated intwo groups: rst,

GFI based on a omparison between the sample ovariane and the model

ovarianeof the manifest variables,whihimplies reduing the information

that is ontained in the data to their ovariane struture, and seondly

Pearson-type statisti when manifest variablesare binary. In this work, we

propose analternativeGoodness-of-tstatistibasedonsomedistane om-

parisonbetween the latent soresand theoriginaldata. This GFItakesinto

aount the natureof eahmanifest variable and an inpriniplebeapplied

in various situations and in partiular with models with both disrete and

ontinuous manifest variables. We propose two proedures to ompute the

p-values of ourGFI.The rst one isbased onthe asymptotidistributionof

aU-statistiandappearstobequitediulttoimplementnumerially. The

seond one is based onresampling tehniques and requires aonsistent esti-

matoroftheloadings,thesores,andaorrespondingasymptotiovariane

(5)

goodperformane interms ofempirialleveland empirialpower, espeially

ompared to the one proposed by Satorra and Bentler (2001).

Finally, a real dataset is analyzed to highlight the appliation of the

methodology. In most health surveys the state of health of individuals is

measured through several qualitative, disrete quantitative or dihotomi

variables. From these variables, one aims at building univariate indiators

of health that summarize the information. To do so, we propose to use a

GLLVM,inwhihthe latentvariablesarethe healthindiators. Weonsider

the data from the 1997 Swiss Health Survey and we dene a new model

with twohealthindiators. Therst one desribesthehealthstatus indued

merely by the age of the subjet, while the seond one aptures another

dimension ofthe health status. This lattermodelisnot rejeted by our GFI

and gives another insight into the understanding of the health status of the

population.

(6)

Les modèles linéaires à variables latentes généralisés permettent de dénir

un lienentre lesvariablesmanifestesetlesvariableslatentes.Ce typede mo-

dèles estbeauouputiliséen sienes humaines. Dansleadred'unmodèleà

variableslatentes,lenombredequantitésinonnuesesttrèsimportant(para-

mètres,sores),ilest dès lorsessentielde hoisir unmodèleaussiprohe que

possibledesdonnées originales.Pourtestersiun modèlepartiulierest perti-

nent,ilfautdénirun testd'adéquation(GFI).Ilexiste un grandnombre de

testsd'adéquationdisponiblesdanslalittératurepourlesmodèlesàvariables

latentes. Ceux-i peuvent être séparés en deux groupes : premièrement, les

GFI basés sur la omparaison entre la matrie de variane-ovariane sous

le modèle à variables latentes et sous le modèle saturé, e qui revient à ré-

duire l'information ontenue dans les données à une matrie de ovariane

et deuxièmement, les statistiques du type de Pearson quand toutes les va-

riables manifestessont binaires.Danse travail,nous proposons un nouveau

testd'adéquationbasésurlaomparaisondesdistanesentrelesobservations

sur l'éhantillon brut et elles données par le modèle.

Ce test peut être en prinipe appliqué ave des variables manifestes de

diérentstypes,enpartiulieravedesvariablesmanifestesdisrètesetonti-

nues. Nousproposons deux tehniques pour évaluer lesp-valeurs de e GFI.

La première est basée sur la distribution asymptotique d'une U-statisti et

(7)

teur onvergent des loadings, des sores ainsi qu'une matrie de ovariane

asymptotique orrespondante.

Une étude du omportement de ette statistique à l'aide de simulations

révèle que laperformane de notre statistiqueest bonne en terme de niveau

empirique et de puissane, en partiulier en omparaison de elle proposée

par Satorra and Bentler (2001). Finalement, une appliation sur un jeu de

donnéesréellesestprésentéepourmettreenévidenel'appliationpotentielle

de ette proédure. Dans les enquêtes de santé, la santé des individus est

mesurée àtravers diérents types de variables ommelesvariables ordinales

ou dihotomiques. A partir de es variables, on herhe à onstruire un ou

plusieurs indiesde santé. Nousproposons iid'utiliserles modèles linéaires

à variables latentes généralisés qui permettent d'estimer une ou plusieurs

variables latentes ontinues à partir d'un groupe de variables observables.

Nous onsidérons ii les données issues de l'enquête suisse sur la santé de

1997. Nous proposons un nouveau modèle ave deux indies de santé : le

premier dérit l'état de santé lié à l'âge du sujet et le seond apture une

dimension de la santé indépendante de l'âge. Ce modèle n'est pas refusé

par notre test d'adéquation et permet de d'évaluer la santé sous un angle

nouveau.

(8)

1 Introdution 3

2 Generalized Linear Latent Variable Models (GLLVM) 9

3 Goodness-of-tfor GeneralizedLinear Latent VariableMod-

els (GLLVM) 15

3.1 Test Statisti . . . 15

3.2 Derivation of the p-value of the Test . . . 18

3.3 Computing the p-value . . . 20

4 Other GFI for GLLVM 23

4.1 LikelihoodRatio Test . . . 24

4.2 Satorraand Bentler (S&B) GFI . . . 25

5 Simulation study 27

5.1 Disussion of the Results . . . 30

Appendix A: SimulationParameters . . . 31

6 Asymptoti distribution 35

6.1 Asymptotis of

ω ₁

^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ³⁵

6.2 Asymptoti distributionof

S

^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ³⁶

(9)

6.2.1 Calulation of

E [h(X i 1 , X _i ₂ )]

^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ³⁶

6.2.2 Estimationof

E [h(X i ₁ , X i ₂ )]

^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ⁴³

6.2.3 Estimationof

ζ ₁

^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ^. ⁴⁶

6.3 Disussion of the Results . . . 47

Appendix B: U-statisti of order 2. . . 48

Appendix C: Expetationestimators . . . 49

7 Appliation: A Latent Variable Approah for the Constru- tion of Continuous Health Indiators 52 7.1 Introdution . . . 52

7.2 Generalized Linear Latent VariableModels . . . 54

7.3 Health Indiators forthe 1997 Swiss Health Survey . . . 56

7.4 Conlusion . . . 62

Appendix D: Model and Estimated Parameters . . . 63

Referenes 67

(10)

Introdution

In many sienti elds, theoretial onepts that annot be measured di-

retly are dened by meansofobservableindiators. Someexamples inlude

intelligeneinpsyhologyorwelfareineonomis. Inthesesituations,onese-

letsobservablevariablessupposedtobelinkedtotheunobservablevariables.

The objetive istolinkthe unobservable variables, alledlatentvariables to

the observed ones, alled manifest variables. Jöreskog (1969) proposed a

model based on a linear link between the latent variables assumed to be

normaland the manifestvariablesassumed tobeonditionallynormalgiven

the latentones. Thisonditionallowstoomputeexat maximumlikelihood

estimatorsusing theovarianematrix ofthemanifestvariables. Thismodel

is implemented in well-known and widely used softwares suh as LISREL

(Jöreskog, 1990) or Mplus (Muthen and Muthen, 2001). However, in eo-

nomis and soial sienes, researhers very often work with surveys whih

ontain responses measured on ordinal or binary sales. When the man-

ifest variables are not normally distributed, the methods implemented in

these softwares suppose thatthe manifest variablesare indiretobservations

(11)

variableapproah, doesn't take diretly intoaountthe atualdistribution

ofthevariables,exeptwhenallmanifestvariablesarebinarywiththeprobit

link.

Bartholomew (1984) and Moustaki and Knott (2000) proposed to drop

the assumption of multivariate normaldistribution with the speiation of

theGeneralizedLinearLatentVariablesModel(GLLVM).Thismodelallows

to onsider all distributions belonging to the exponential family, whih in-

ludes both ontinuous and disrete distributions,suhas normal,binomial,

Poisson,et. WithGLLVM,the likelihoodfuntiondependson multivariate

integralswhihannotbealulatedanalytially. Tooveromethis problem,

Moustaki and Knott(2000) deneestimators based onaGauss-Hermite ap-

proximationofthese integrals. Unfortunately,this methodisofteninfeasible

when the number oflatentvariablesislarger than 2. Huber, Ronhetti, and

Vitoria-Feser (2004) propose instead to use a Laplae approximation and

dene new estimators whih an be viewed as M-estimators (Huber, 1981).

This allows onsistent estimation and inferene even in the presene of a

large numberof latentand manifestvariables. Furthermore,the Laplaeap-

proximation allows to dene the h-likelihood sores on eah latent variable,

the asymptoti properties of whih an be found in Lee and Nelder (1996).

These estimated sores an also be viewed as penalized quasi likelihood es-

timators (Green, 1987). Moreover, inthe Bayesian approah, the estimated

soresarethemodaloftheposteriordistributionofthesoreswithestimated

parameters. These estimators are alled the EmpirialBayes modal(EBM)

bySkrondalandRabe-Hesketh(2004),ormodalaposteriori(MAP)byBok

(1985). Forageneraloverview onlatentvariablemodeling,seeSkrondaland

(12)

Inalatentvariableframework,oneworkswithseveralunobservablequan-

tities (latent sores, parameters) and it is therefore essential to hoose a

model as lose as possible to the data. More speially, the number of la-

tentvariablesislearlyunknown,aswellasthefatthatapartiularmanifest

variableis linked or not to a partiular latent variable. To ompare models

with a dierentnumber of latent variables, one ould use the Akaike(1973)

riterion, whih is a powerful measure of relative t; see Conne (2003) for

some numerial omparison in this ontext. However, this riterion suers

fromanimportantdrawbakwhenapplieddiretlytolatentvariablesmodels

inludingthe denitionofthe likelihoodand the roleof thesoresas param-

eters. For a disussion and a proposition of a orreted Akaike riterionfor

mixed-eetsmodels, see e.gVaida andBlanhard(2005). Even if oneould

extendthis riteriontolatentvariablesmodels,itwouldn'tdesribeabsolute

t and wouldn't test the appropriateness of a partiular model. To do so,a

Goodness-of-tindex (GFI) is neessary.

Several GFIare available for latent variable models in the literature. In

the ase of multivariate normal manifest variables, a natural hoie is the

likelihood ratio test (LRT) on the ovariane matrix, see e.g Bartholomew

and Knott (1999). More preisely, the LRT is dened as the dierene be-

tweenthelog-likelihoodevaluatedwiththe ovarianematrixonstrainedby

the latent model and thatevaluated atthe unonstrained ovariane matrix

(saturated model). When manifest variables are not multivariate normal,

one an use a funtion of the ovariane matrix of the underlying normally

distributed manifest variables to dene a Goodness-of-t. Indeed, the un-

onstrained ovarianematrix an be estimated withpolyhoriorrelations

(13)

method suers from two drawbaks: rstly, using polyhori or polyserial

orrelations implies reduing the information that is ontained in the data

to their ovariane struture. Sine the ovariane matrix is not a suient

statisti when the variables are ordinal or binary, this implies a loss of in-

formation. Moreover, the estimation of polyhori or polyserial orrelations

impliesthe estimationofanumberof parametersthatinreases rapidlywith

the number of manifest variables. Seondly, this method is strongly de-

pendent on the normality assumption of the underlying manifest variables.

Therefore, a GFIbased ona funtionof the ovariane matrix ofthe under-

lying normally distributed manifest variables in a non-normal setting tests

simultaneously the normalityassumption of the underlyingvariable and the

t of the GLLVM struture, exept when all manifest variables are binary,

see e.g. Muthen (1993).

In the binary ase, the data an be represented by a ontingeny table

and a Pearson-type statisti an be derived. However, for a moderate sam-

ple size and animportantnumberof variables,one isfaed with theurse of

dimensionality problemand this methodbeomes unfeasible. In the ordinal

ase, the number of ells an be large and the problem is even worse. To

overome this problem, Glas (1988), Reiser (1996) or Bartholomew and Le-

ung (2002) propose to use only informationonlower order margins(usually

rst and seond order). This method an only be used if all variables are

binary orordinal, but it is unlear howto extend it to mixtures of dierent

type of manifest variables.

Othertypesofgoodness-of-tstatistisavailableinsoftwareslikeLISREL

and MPLUS, are the RNI, RMSEA or SRMR. All of them are funtions of

(14)

Moreover, if the manifest variables are not multivariate normal, their null

asymptoti distributions are not available. They are used with empirial

ut-o riteria and their p-values are not uniformly distributed under the

null hypothesis,see Marsh, Hau,and Wen (2004).

We propose an alternative goodness-of-t statisti whih is not based

on the omparison between the orrelation matries omputed respetively

diretly from the data and through the estimated GLLVM, but on some

distane omparison between the latent sores and the original data. The

onept of distane is widely used in the ontext of luster analysis to nd

subjets that are similar,see e.g Kaufman and Rousseeuw (1990). Our GFI

is based on the idea that two subjets that are very dissimilar in the data

spae should also be very dissimilarin the latent sores spae, if the hosen

model (under the null hypothesis)is orret.

OurGFIaninpriniplebeappliedinvarioussituationsandinpartiular

with models with both disrete and ontinuous manifest variables. For sim-

pliity we develop here our GFI in the ase of independent latent variables.

However, theproedure anbeextended totheaseoforrelatedlatentvari-

ables by using the estimator of the parameters and the sores whih was

proposedby Huber, Ronhetti, andVitoria-Feser(2004),p. 900. Thelatter

is anextension of the independent ase presented inChapter 2 tothe orre-

latedase. TheasymptotidistributionofourGFIunderthenullhypothesis

is derived but it appears to be quitediult to implementnumerially. We

propose instead to ompute p-values using dierent resampling tehniques

that we optimize for omputational speed. Our simulations show that the

p-values obtained by this proedure have a uniform distribution under the

(15)

tives that are near the null model. Finally, we ompare our proedure in

terms of empirial levels and powers to a GFI implemented in the software

MPLUS and show that our approah improves onsiderably goodness-of-t

testing within GLLVM.

The thesis is organized as follows: in Chapter 2, we briey review the

GLLVM and the estimation tehnique based on the Laplae approximation

developed by Huber, Ronhetti,and Vitoria-Feser(2004). In Chapter3,we

propose a new GFI based on distane omparison and a proedure to om-

puteassoiatedp-values. DierentavailableGFIare reviewed andompared

in Chapter 4. Asimulationstudy omparing our GFI and the one proposed

bySatorraandBentler(2001)ispresentedinChapter5. Thisshowsthatour

GFIhas betterperformane interms ofempirialleveland empirialpower.

In Chapter 6, we present a proedure to approximate the asymptoti dis-

tribution of this Goodness-of-ttest statisti. This methodis based on the

asymptoti distribution of a U-statisti whih requires the alulation and

the estimation of the rst and seondmomentsof the test statisti. Finally,

a small simulation study is presented in Setion6.3. From a pratialpoint

of view, the asymptotis approximation is not aurate enough to be used

for typial sample sizes enountered in real data sets and so this approxi-

mation should not be used in pratie. In Chapter 7, as an appliation, we

onsider the data from the 1997 Swiss Health Survey and build two health

indiators. A latent model is dened and tested with our GFI. This latter

model gives another insight into the understanding of the health status of

the swiss population.

(16)

Generalized Linear Latent

Variable Models (GLLVM)

The relationship between

p

^manifest ^variables

x ^(j) , j = 1, .., p

^and

q

^latent

variables

z ^(k) , k = 1, ..., q

^,

q < p

^, ^is^formalized ⁱⁿ^the ^same^manner^as ⁱⁿ^gen-

eralizedlinearmodels(MCullaghandNelder,1989)bymeansofonditional

distributions

g j (x ^(j) | z)

^belonging^to ^the exponential family,i.e.

g j (x ^(j) | z) = exp

x ^(j) α ^(j)T z − b(α ^(j)T z) φ j

+ c(x ^(j) , φ j )

,

where

α ^(j) ∈ ℜ ^q+1

^are ^the^loadings,

φ j

îsâ^sale^parameter ând^the^funtions

b(α ^(j)T z)

^and

c(x ^(j) , φ j )

^depend ^on ^the ^spei distribution

g j (x ^(j) | z)

^. ^W^e

denealinearlinkbetween afuntionoftheonditionalexpetationof

x ^(j) | z

and

z

^as

γ ^(j) E(x ^(j) | z)

= α ^(j)T z

^, ^where

α ^(j)T = (α ^(j) _l ) l=0,1,...,q ,

^(2.1)

z = (1, z ₍₂₎ ) ^T , z ₍₂₎ = (z ⁽¹⁾ , ..., z ^(q) ) ^T .

^(2.2)

(17)

We give here the spei funtion

b

^,

c

^and

γ

^and ^the ^sale ^parameter

φ

^for

normal and ordinalonditional distribution

g j (x ^(j) | z)

^.

•

^Normal ^manifest ^v^ariables

Let

x ^(j) | z

^have ^a ^normal distribution with mean

µ

^and ^variane

σ ²

^.

The link funtion

γ()

^is ^the ^identity ^funtion

γ ^(j) E(x ^(j) | z)

= E(x ^(j) | z) = µ = α ^(j)T z,

b(α ^(j)T z) = (α ^(j)T z) ²

2 ,

c(x ^(j) , φ j ) = − 1 2

x ²

φ + log(2πφ)

and

φ = σ ²

^.

•

^Ordinal ^manifest ^variables

Let

x ^(j) | z

^follow ^an ^ordered multinomial distribution with ategories goingfrom1to

M ^(j)

^. ^The^link^funtionân^be^hosenâsâ^logit^funtion

γ ^(j) p s

1 − p s

= α ^(j)T _s z,

where

p s

îs^dened âs^the ûmulativeprobabilityofaresponse

x ^(j) | z

^to

be

s

^or ^less, ^where

s = 1, ..., M ^(j)

^. ^The

s

^index ^on

α ^(j) s

^indiates ^that

the rstomponent

α ^(j) _0,s

^of ^the^vetor

α ^(j)

îsâ^threshold^that îs^related

to eah ategory.

b(α ^(j)T z) = log p s+1

p _s+1 − p s

,

c(x ^(j) , φ j ) = 0

(18)

The main assumption in the GLLVM is the onditional independene of

the manifest variables given the latent ones. Hene, the joint onditional

distributionis

Q p

j=1 g j (x ^(j) | z)

^. ^Assuming^further^that^the^latent^variables^are

independentanddistributedasstandardnormalvariables,thejointmarginal

distribution of the manifest variables is given by

f α,φ ( x ) =

Z " Y ^p

j=1

g j (x ^(j) | z )

#

h( z ₍₂₎ )dz (2) ,

^(2.3)

where

h( · )

îs^the ^density ôf â

q

-dimensional standard normalvariable. Given a sample of

n

observations

x _i = (x ⁽¹⁾ _i , ..., x ^(p) _i ) ^T

^,

i = 1, .., n

^, ^the ^log-

likelihood of the

(q + 1) × p

^loadings ^matrix

α

^and ^the

p

^-vetor ^of ^sale

parameters

φ

^is

ℓ(α, φ | x) = P n

i=1 log f α,φ (x i )

^. ^This êxpression ôntains â

multidimensionalintegralwhih annotbeomputed expliitly,exeptwhen

x | z

^isdistributedaordingtoamultivariatenormal. Huber,Ronhetti,and Vitoria-Feser (2004) propose to use a Laplae approximation of (2.3) and

deneanestimatoralledLAMLE. Thisidea hasbeen usedinseveral elds,

inluding the Bayesian approah to approximate the posterior distribution

(see e.g. Tierney and Kadane, 1986) and in a simpliedform ingeneralized

linear mixed models (GLLAMM) (see Breslow and Clayton, 1993).

Sine our GFI will be based on the LAMLE, we give here the denition

and the properties of the LAMLE. For more details, see Huber, Ronhetti,

and Vitoria-Feser(2004). Byrewriting the density of

x _i

ⁱⁿ ^(2.3)^as

f α,φ (x i ) = Z

e ^p ^· ^Q(α,z,x ⁱ ⁾ dz (2) ,

^(2.4)

(19)

Q(α, z, x _i ) = 1 p

h X ^p

j=1

x ^(j) _i α ^(j)T z − b j (α ^(j)T z) φ j

+ c j (x ^(j) _i , φ j )

− z ^T ₍₂₎ z ₍₂₎ 2 − q

2 log(2π) i

,

^(2.5)

and by applying the Laplaeapproximationto (2.4), we obtain

f α,φ (x i ) = 2π

p ^q ₂

det( − W ( z ˆ _i(2) )) ⁻ ¹ ² e ^{p Q(α,} ^z ^ˆ ⁱ ^,x ⁱ ⁾ (1 + O(p ⁻ ¹ )),

^(2.6)

where

W (z) = ∂ ² Q( α, z, x _i )

∂z ^T ₍₂₎ z ₍₂₎ = − 1

p Γ (α, φ, z), Γ (α, φ, z) =

X p

j=1

1 φ j

∂ ² b j (α ^(j)T z)

∂z ^T z + I _q .

(2.6) depends on the unknown quantity

z ˆ _i

^,^the ^maximum ^of ^the ^funtion

Q

dened through

∂Q( α, z b _i , x _i )

∂ z b _i = 0,

^(2.7)

and an be estimated iteratively by means of

ˆ

z _i(2) = ˆ z _i(2) (α, φ, x i ) = X p

j=1

1 φ j

x ^(j) _i − ∂b j (α ^(j)T z ˆ i )

∂α ^(j)T z ˆ _i

α ^(j) ₍₂₎ ,

^(2.8)

where

α ^(j) = (α ^(j) ₀ , α ^(j)T ₍₂₎ ) ^T

^. ^It ^should ^be ^noted ^that

z b _i(2)

^an ^be interpreted as the maximum likelihoodestimators of the latent sores. Indeed, if the

z _i(2)

^were ^onsidered ^as parameters, the rst derivative of the likelihood with respet to

z _i(2)

^for ^xed

α

^and

φ

^leads ^exatly ^to^the ^expression ^(2.7).

(20)

who show that as

n → ∞

n ¹ ² ( z b _i(2) − z _i(2) ) → ^D N (0, Γ (α, φ, z _i ) ⁻ ¹ ).

^(2.9)

From (2.5)and (2.6), we obtainthe Laplae approximated log-likelihood

funtion

˜ l(α, φ, | x) = X n

i=1

− 1

2 log det { Γ (α, φ, z ˆ _i ) } − z ˆ ^T _i(2) z ˆ _i(2) 2 +

X p

j=1

( x ^j _i α ^(j)T z ˆ _i − b j (α ^(j)T z ˆ _i ) φ j

+ c j (x ^j _i , φ j ) )

.

^(2.10)

The resulting loadingsestimatorof

α

âlled^the ^LAMLE îs^the ^solution ôf

∂ ℓ(α, ˜ φ | x)

∂α kl

= X n

i=1

ψ(x i ; α, z b _i ) = 0, k = 1, ..., p,

^(2.11)

where

ψ(x i ; α, z b _i ) = − 1 2

^tr

Γ (α, φ, z b _i ) ⁻ ¹ ∂ Γ (α, φ, z b _i )

∂α kl

+ 1

φ k

x ^(k) _i − ∂b k (α ^T _k z b i )

∂α ^T _k z b _i

ˆ

z il ,

^(2.12)

with

b z _i

^given ^impliitly^by ^(2.8). ^Equation ^(2.11)^whih^denes ^the ^LAMLE

may have multiple solutions. If

q > 1

^, ît îs ^neessary ^to împose

^q(q ₂ ⁻ ¹⁾

^on-

straintson the parameters

α

^to ôbtainâ ûnique ^solution.

(21)

The LAMLE

α ˆ

^belongs ^to ^the ^lass ^of M-estimators (Huber, 1981), and under the onditions given in Huber(1981), as

n → ∞

n ¹ ² ( ˆ α − α) → ^D N (0, V (α)),

^(2.13)

where

V ( α ) = B( α ) ⁻ ¹ A( α )B( α ) ⁻ ^T ,

^(2.14)

A(α) = E

ψ(x; α, b z)ψ ^T (x; α, z) b

,

B(α) = − E ∂ψ(x; α, b z)

∂α ,

and the expetations are taken under the GLLVM model. Formore details,

speiallyforthespeiestimationequationsinthenormalandordinalases,

see Huber, Ronhetti, and Vitoria-Feser (2004)and Huber(2004).

(22)

Goodness-of-t for Generalized

Linear Latent Variable Models

(GLLVM)

3.1 Test Statisti

TheobjetiveofaGFIistomeasurethedistanebetweenasuitablequantity

omputed fromthe sampleand itsestimated ounterpartusing the assumed

model. ThebasiideaofourGFIistoomparethedistaneamongthelatent

sores and the orresponding distane among the originalobservations. The

latentsoresrepresentinawaythemappingoftheobservationsonthelatent

variablespae. Hene,if the model is adequate, toa distane between two

observationsintheoriginaldataspaeshouldorrespondasimilardistane

onthelatentspae. Clearly,weneedtodeneageneraldistane measure on

the latent spae and onthe data spae while taking intoaount the nature

of the dierent variables. We propose here to use the distanes developed

(23)

aordingtotheGLLVM,eahobservation

x i

^has^aorresponding(unknown)

latentsore

z i

^estimated^by

z ˆ i

^by^means^of^(2.7). ^Let

d q (ˆ z i ₁ , z ˆ i ₂ )

^be^a^distane

funtion onthe sores spaeand

d ˜ p ( x _i ₁ , x _i ₂ )

^a^distane ^funtion ^on^the ^data

spae. Sine

z ˆ

îs ôntinuous, â ^natural ^hoie ^for

d q ( · , · )

^is ^the ^Eulidean

distane standardizedby the standard deviation of

z ˆ _i

^, ^i.e.

d q (ˆ z i ₁ , ˆ z i ₂ ) = 1 q

v u u t

X q

j=1

ˆ

z _i ^(j) ₁ − z ˆ _i ^(j) ₂

˜ σ z ^(j)

! 2

,

^(3.1)

where

σ ˜ z ^(j) = q

1 n

P n

i=1 (ˆ z ^(j) _i − z ˆ ^(j) ) ²

îs ^the êmpirial ^standard ^deviations ôf

the

z ˆ ^(j) _i

^, ^the

j

^omponent ^of ^the ^vetor

ˆ z _i

^. ^In ^the ^sample ^spae, ^if

x ^(j)

is normally distributed, the Eulidean distane funtion is also suitable for

d ˜ p ( · , · )

^. ^When

x ^(j)

îs ôrdinal, â ^standard ^hoie îs ^the ^Manhattan ^distane

(

L 1

^distane) ^on ^the ^ranks

r _i ^(j)

^of ^the observations. Hene for a model with

p 1

^normal ^manifest^variables^and

p 2

^ordinal^manifest ^variables,^we ^have

d ˜ p (x i 1 , x _i ₂ ) = 1 p ₁

v u u t

p 1

X

j=1

x ^(j) _i ₁ − x ^(j) _i ₂

˜ σ x ^(j)

! 2

+ 1 p ₂

p 2

X

j=1

r ^(j) _i ₁ − r ^(j) _i ₂

n 2

,

^(3.2)

where

σ ˜ ^(j) x

îs ^the êmpirial^standard ^deviationsôf

x ^(j)

^and

ⁿ ₂

^is ^a^sale ^fator

for the ranks orresponding to the maximum of the dierenes

r _i ^(j) ₁ − r _i ^(j) ₂

^,

whih has the same order as the variane of

r _i ^(j)

^. Consequently, a natural GFI isdened by

(24)

S(x, z ˆ | α) = ˆ 1

n ² − n 2

X n

i 1 =1

X n

i 2 =1 i ₁ >i ₂

h

d q (ˆ z _i ₁ , z ˆ _i ₂ ) − d ˜ p (x i 1 , x _i ₂ ) i 2

= 1

n ² − n 2

X n

i 1 =1

X n

i 2 =1 i 1 >i 2

1 q

v u u t

X q

j=1

ˆ

z _i ^(j) ₁ − z ˆ _i ^(j) ₂

˜ σ z ^(j)

! 2

− 1 p ₁

v u u t

p 1

X

j=1

x ^(j) _i ₁ − x ^(j) _i ₂

˜ σ ^(j) x

! 2

− 1 p ₂

p 2

X

j=1

r ^(j) _i ₁ − r ^(j) _i ₂

n 2

2 .

^(3.3)

Basially,thisGFIisanaveragesquareddierenebetweenageneraldistane

on the sample spae and its estimated ounterpart on the latent spae. It

implies that only the latent sores matrix is used in the model part of our

GFI.Otherdistanesanbespeiedfor

d q ( · , · )

^and

d ˜ p ( · , · )

^. ^However, ^Conne

(2005) shows in a simulationstudy that

S

^has ^a^good^performane ⁱⁿ^terms

of empirialpower ompared tothe GFI basedon other distanes.

Sine the distribution of

S

^depends ^on

α ˆ

^through

z ˆ

^, ⁱⁿ ^order ^to ^obtain

orret inferene,

α ˆ

îs întegrated ôut ûsingîts âsymptotidistributiongiven by (2.13). It turns out that this orresponds to making a orretion on

S

based on a distane between two estimated asymptoti ovariane matries

of

α ˆ

^, ^namely

V ( ˜ α)

^and

V ( ˆ α)

^. ^This ^leads ^to ^the ^following^GFI

Ω = 2



 det

V ˆ ( ˜ α) det

V ˆ ( ˆ α)





1 2

· S(x, z ˆ | α). ˆ

^(3.4)

This orretion fator willbe derived inthe next setion.

(25)

p-valuesomputedusing

ν S | α ˆ (s | α) ˆ

^,^the ^onditional^density^of

S

^given

α ˆ

^,^will

dependon

α ˆ

^. Înôrder^toôbtainôrretunonditionalinferene,weonsider

ˆ

α

âs â ^nuisane ^parameter, ând ^we întegrate ôut

α ˆ

ûsing îts âsymptoti

normal distributiongiven by (2.13),i.e.

f S (s) = Z

ν S | α ˆ (s | α) ˆ · h

V ⁻ ¹ ² (α)(vec( ˆ α) − vec(α))

| det(V (α)) | ⁻ ¹ ² d α ˆ

= 1

(2π) ^p ² ^˜

1 | det(V (α)) | ¹ ² · Z

ν S | α ˆ (s | α) ˆ · exp (˜ p · κ( ˆ α)) d α, ˆ

^(3.5)

where

κ( ˆ α) = − 1

2˜ p · vec( ˆ α) − vec(α))V ⁻ ¹ (α)(vec( ˆ α) − vec(α) T

,

h()

îs ^the ^density ^funtion ôf ^the ^standard ^normalând

p ˜ = dim(vec( ˆ α ))

^.

The term outside the integral depends on an unknown matrix

V (α)

^,

whihwill be estimated by

V ˆ ( ˜ α)

^with

α ˜

^dened ^below.

Moreover, the maximum of the funtion

κ( ˆ α)

îsâhieved ât

α ˆ = α

^with

κ(α) = 0

^. ^Applying^the

p ˜

-dimensionalLaplaeapproximationtotheintegral in (3.5), weobtain

Z

ν S | α ˆ (s | α) ˆ · exp (˜ p · κ( ˆ α)) d α ˆ = 1 2 q

| det( − ¹ _p _˜ V ⁻ ¹ (α)) | · ν S | α ˆ (s | α) · 2π

˜ p

^p ₂ ^˜

{ 1 + O(˜ p ⁻ ¹ ) }

^(3.6)

(26)

f S (s) = 1 (2π) ^p ² ^˜

1 | det(V ( ˜ α)) | ¹ ² · 1 2 q

| det( − ¹ _p _˜ V ⁻ ¹ (α)) |

ν S | α ˆ (s | α) 2π

˜ p

^p ₂ ^˜

{ 1 + O(˜ p ⁻ ¹ ) }

= 1 2

| det(V (α)) |

| det(V ( ˜ α)) | ¹ ₂

· ν S | α ˆ (s | α) { 1 + O(˜ p ⁻ ¹ ) } .

^(3.7)

α

îs ûnknown ând ^will^be êstimated ^by

α ˆ

^, ^see ^(2.11). ^Finally^, ^we^obtain

f S (s) = 1 2

| det(V ( ˆ α)) |

| det(V ( ˜ α)) | ¹ ₂

· ν S | α ˆ (s | α ˆ ) { 1 + O(˜ p ⁻ ¹ ) }

^(3.8)

whihdenes the orretion fator of the GFI

S

^, ^leading^to

Ω

ⁱⁿ ^(3.4).

Forreasonsof numerialstability and sine the log-likelihoodfuntionis

approximatedby

˜ l

ⁱⁿ ^(2.10),^we^use

V ˆ ( α ) = 1

n ² X n

i=1

"

∂ ˜ l(α, φ | x _i )

∂α

T

· ∂ ˜ l(α, φ | x _i )

∂α

#! ₋ 1

.

^(3.9)

instead of

1 n V (α)

^the âsymptotiôvariane^matrix ^given ^by ^(2.14). ^Tô ^get

V ˆ ( ˆ α)

^,

α

^is ^replaed ^by

α ˆ

ⁱⁿ ^(3.9). ^(3.9) ^is ^preferred ^to

_n ¹ V (α)

^beause ^the

derivative of

ψ

^appears ^to^be ^very ^unstable ⁱⁿsimulations.

Note that if

α ˆ

^and

α ˜

âre ^the ^same êstimator, ^the ôrretion ^fator ^be-

omes simplytwo. Sine our empirialexperiene shows that aorretion is

deisive in having a orret inferene, we propose to onsider two dierent

estimators

α ˆ

^and

α ˜

ⁱⁿ

V ˆ (α)

^,^where

α ˜

^has^a^smaller ^variane. ^F^or^that,^sev-

eral tehniques ould in priniple be onsidered, but we propose to use the

bagging proedure (Breiman, 1996). Our simulation study shows that this

hoie isadequateatleast forthemodelwehaveinvestigated. The omplete

algorithmis presented in the next setion.

(27)

the distribution of

ν S | α ˆ (s | α) ˆ

^. ^In ^Chapter ^6,^the ^null^asymptotidistribution of

Ω

îs^derived ^whihîs^shown ^to^be^normal^with ^ratherômpliatedêxpres-

sions forthemeanand variane. Toomputethe latter,oneneeds numerial

approximations whih makes inferene quite unstable if not inappropriate.

We prefer instead to approximate

ν S | α ˆ (s | α) ˆ

^by ^means ^of ^resampling ^meth-

ods. Parametri bootstrap has been widely used in goodness-of-t testing

as for example by Romano (1988). However, a diret parametri bootstrap

would betoo omputer intensive,beause

α ˆ

^and

z ˆ i

^need ^to^be ^omputed ^at

eahbootstrappedsample. Therefore, followingasimilarideaasinSalibian-

Barrera and Zamar (2002), we propose a fast parametri bootstrap that

avoids the omputationof

α ˆ

^at^eah bootstrapped sample.

3.3 Computing the p-value

First, we need to estimate

V ˆ ( ˜ α)

^using ^the ^bagging ^proedure ^whih ^an

be summarized in the following way. Let

y = (x, z) ˆ

^be ^a ^data ^set ^with

orresponding estimated sores.

Repeat for

b = 1, .., B

^:

1. Generate arandomsample

y ^⋆ _b = (x ^⋆ _b , z ˆ ^⋆ _b )

^of^size

n

^from

y

^with ^replae-

ment.

2. Estimate the loadings

α ˜ ^⋆ _b

orresponding to the sample

y ^⋆ _b

^using ^(2.11)

with

z ˆ ^⋆ _b

^xed.

3. Evaluate

V ˆ ^⋆ _b ( ˆ α ^⋆ _b | y ^⋆ _b )

^with ^(3.9).

V ˆ ( ˜ α) = ¹ P B V ˆ ^⋆

(28)

Note that in step 2. we ompute the loadings

α ˜ ^⋆ _b

onditionally on the original

z ˆ ^⋆ _b

^. ^One^ould ^reestimate^both

α ˜ ^⋆ _b

^and

z ˆ ^⋆ _i

^but^our^proedure ^is^muh

faster and stable.

The fast parametri bootstrap we propose an be summarized in the

followingway. Let

x

^be^the ^data ^set ^supposed ^to^be ^generated^by ^a^GLL^VM

model. Let also

z ˆ

^, ^and

α ˆ

^and

φ ˆ

^be ^the orresponding estimated sores, loadings and sale parameters respetively and

V ˆ ( ˆ α)

^the ^ovariane ^matrix

evaluated with (3.9), using

α ˆ

^,

φ ˆ

^and

z ˆ _i

^,

i = 1, ..., n

^.

Repeat for

b = 1, .., B

^:

1. Generate one

α ^⋆ _b

^from îts êstimated âsymptoti distribution

α ^⋆ _b ∼

N

ˆ

α, V ˆ ( ˆ α)

.

2. Generate

q

independent standard normal vetors

z

^of ^size

n

^.

3. Generate a vetor

µ = E[x | z]

ôf ônditional ^means ôf âll ^responses

dened by

γ ( µ ) = α ^⋆T _b z.

^(3.10)

4. Generate the bootstrapped sampleof manifest variable

x ^⋆ _b

^based ^upon

the meanthat were alulatedin(3.10)aswellasthe saleparameters

φ ˆ

^for ^the ^normal^responses.

5. Given the bootstrapped sample, estimate

z ˆ ^⋆ _b

onditionallyon

α ^⋆ _b

^with

(2.7).

6. Evaluate

V ˆ (α ^⋆ _b | x ^⋆ _b , z ˆ ^⋆ _b )

^with ^(3.9).

Ω ^⋆ _b = 2 _{det( ˆ} _V _(α ⋆ b )) ¹ ₂

· S(x ^⋆ _b , z ˆ ^⋆ _b | α ^⋆ _b )

(29)

p − \ value = 1

B # { Ω ^⋆ _b > w } ,

where

w

îs ^the ôbserved ^value ôf

Ω

ômputed ôn ^the ôriginal^sample. ^Note

that instep 5. both

z ˆ ^⋆ _b

^and

α ^⋆ _b

^ould ^bereestimated but this would inrease the omputational time without improvement on the performane in terms

of p-values. The variability of

α ˆ

îs ^taken înto âount ⁱⁿ^the ^rst ^step. Ône

ould also use

V ˆ ( ˜ α)

âs â ôvariane ^matrix êstimate ^to^simulate

α ^⋆ _b

ⁱⁿ ^step

1. However, under the null hypothesis, the p-values assoiated with this

proedure donot seem tobe aslose touniformasthe one presented above.

Finally,itshouldbestressed thatthisstatistiandtheproedure toeval-

uate its p-value is widely appliable. Indeed, if one uses another onsistent

estimator of

α

^,

z

^and ^a orresponding asymptoti ovariane matrix, one ould apply the same proedure to dene a goodness-of-t index and its

orresponding p-value.

(30)

Other GFI for GLLVM

In this hapter, we present the LRT in the GLLVM framework and the

SatorraandBentler(S&B)Goodness-of-t(SatorraandBentler,2001). Cut-

oriteriasuhasRMSEAorSRMR arenot presentedbeausetheyare not

omparable toour GFIoutside the ase of normalmanifest variables.

Inthebinary ase, Pearson-typestatistisareused. They arebased ona

omparison between the empirialfrequenies andthe estimated frequenies

under the model. Pearson-typestatistisrequire a large numberof observa-

tions in eah ell of the ontingeny table for their asymptoti distribution

to hold. To avoid this problem of sparsity, statistis similar to Pearson's

but using only information from lower margins, usually bivariate margins,

are available, see e.g Bartholomew and Leung (2002) or Maydeu-Olivares

and Joe (2005). However, if there are signiant interations among higher

order margins, the GFI may not rejet false models. Moreover, in the ordi-

nal ase, this method isonly appliable when the number of observations is

large. Sine the models we use inour simulation are too omplex for GFI's

based on Pearson-type statisti, they will not be ompared to our GFI in

(31)

Suppose rst that the manifest variablesare onditionallynormal given the

latentones, i.e.

x | z ∼ N p (α ^T z, ζ),

^(4.1)

where

ζ

^is ^a^diagonal ^matrix. ^Let

var(x) = Σ

, then

Σ = α ^T α + ζ

^and ^the

likelihoodfuntion

l( Σ ⁻ ¹ ) ∝ n 2

log | Σ ⁻ ¹ | − trace[ Σ ⁻ ¹ C]

,

where

C

îs^the êmpirialôvarianeôrôrrelation ^matrix. ^Suppose ^we ^have

estimators

α ˆ

^and

ζ ˆ

^for

α

^and

ζ

^. ^Then ^if ^the ^number ^of ^latent ^variables

q

^is ^known ^or ^xed, ^the ^likelihood ^ratio ^statisti ^for ^the ^null ^hypothesis

H ₀ : Σ = α 0 T α 0 + ζ 0

^, ^or equivalently

H ₀ : x ∼

^N

^p µ, α ^T ₀ α ₀ + ζ ₀

against

the alternativethat

Σ

isunonstrained is

n h

− log | Σ ˆ ⁻ ¹ C | + trace[ Σ ˆ ⁻ ¹ C] − p i

,

^(4.2)

where

Σ ˆ = ˆ α ^T α ˆ + ˆ ζ

^. ^If^the ^manifest^variables^areonditionallynormal,this statistiisasymptotiallydistributedasa

χ ²

^with

¹ ₂ [(p − q) ² − (p+ q)]

^degrees

of freedom.

If the manifest variables are not (onditionally) normal, the problem of

estimating the likelihood under the alternative arises. In the binary ase,

the multivariatedensity has

2 ^p − 1

^parameters^whihân ^be êstimatedûsing

moment estimators of

E[x ^(j ¹ ⁾ · x ^(j ² ⁾ ], j 1 6 = j ₂

^; ^see ê.g ^Têugels ^(1990). Êven

(32)

suers from two drawbaks. First, this proedure annot be applied to a

mixtureofontinuous anddisretemanifestvariables,beausethelikelihood

under thealternativewould bealmostimpossible toderive. Seondly, inthe

ordinalase,thenumberofobservationsneedstobeverylargeifthenumber

of manifestvariablesismoderatetohigh. Indeed,the numberofparameters

inreases very quikly with the number of manifest variables.

4.2 Satorra and Bentler (S&B) GFI

Thisstatistiisbasedonaomparisonbetweenthesampleovariane

C

^and

the model ovariane

Σ

of the manifest variables

x = (x ⁽¹⁾ , ..., x ^(p) )

^. ^If ^the

manifest variables are a mixture of ordinal and normal variables, the usual

estimatorofthesampleovarianematrix

C

^is^replaed ^by^the ^polyhori ^or

polyserialovarianematrix, seee.g Qu, Piedmonte, andMedendorp(1995).

Let

s

^and

σ

^be ^the ^vetor ôntaining âll ^the ^distint ^values ôf

C

^and

Σ

respetively, and

Υ

the asymptoti ovariane matrix of

√ n(s − σ)

^. ^Let

θ

be the vetor ontaining all the parameters of the GLLVM. Consider

σ(θ)

and the funtion

F ( θ ) := ( s − σ ( θ )) ^T Υ ˆ ⁻ ¹ ( s − σ ( θ )),

^(4.3)

where

Υ ˆ ⁻ ¹

isa onsistentestimator of

Υ ⁻ ¹

. Then the S&B statistiis

T = n F (s, σ(ˆ θ))

c ,

^(4.4)

(33)

where

θ ˆ

^is^the ^weighted ^least^squares ^estimator^(WLS) ^whih ^minimize^(4.3)

and

c = dim(θ) r ₀ ,

where

r ₀ = dim(σ) − dim(θ)

^. ^Then

T → ^D

r ₀

X

j=1

λ j (χ ² ₁ ) j ,

^(4.5)

as

n → ∞

^, ^where ^the

(χ ² ₁ ) j

^are independent hi-squares variables with one df and the

λ j

^are ^the ^nonnull eigenvalues of the matrix

U ₀ Υ

, with

U ₀ = ˆ Υ ⁻ ¹ − Υ ˆ ⁻ ¹ ∆ ( ∆ ^T Υ ˆ ⁻ ¹ ∆ ) ⁻ ¹ ∆ ^T Υ ˆ ⁻ ¹ ,

and

∆ = _∂θ ^∂ T σ(θ)

^.

In the ase of disrete manifest variables, the problem that the informa-

tion drawn from the sample is redued to the estimated ovariane matrix

still remains. If we ompare our statisti with (4.2) or (4.4), both measure

a distane onsisting ina model part and a sample part. The model part of

Ω

ⁱⁿ ^(3.4) ^depends ^on^the

(n × q)

^matrix ^of

z ˆ

^while ^the ^model ^part ^of

T

ⁱⁿ

(4.4) or the LRT based on (4.2) depend only on the

(q × p)

^matrix ^of ^the

estimated loadings. Sine the S&B statisti is widely used in pratie, we

will ompare itsperformane to

Ω

ⁱⁿ^the ^simulation ^presented ⁱⁿ ^hapter ^5.

(34)

Simulation study

In this Chapter, we study the behavior in terms of level and power of our

GFI

Ω

ând ômpare ît ^to ^the ^behavior ôf ^the ^S&B ^GFI

T

^.

T

ând îts âs-

soiated p-value, omputed by means of the asymptoti distribution given

in (4.5), are omputed with Mplus. We onsider models ontaining 2 and 3

latentvariables. 10000samples ofsize 100 (for themodelwith 2latent vari-

ables) and 5000 samples of size 200 (for the model with 3 latent variables)

were simulated. They ontain respetively 5ordinalmanifestvariablesanda

mixture of 5ordinal manifest variablesand 5normal manifest variables. 30

randomsampleswere generatedinthe baggingproedure and100 bootstrap

samples inthe fastparametri bootstrap.

The samples of size

n

^are ^generated ⁱⁿ ^S-Plus ^using ^the ^following ^proe-

dure

1. Initialize allthe parameters:

• p(q + 1)

^elements ^of ^the ^matrix

α

^,

• p ₁

^varianes^dening ^the ^vetor

φ

^.

• (s − 1)p 2 α _s

(35)

2. Generate

q

independent standard normal vetors

z j

^of ^size

n

^.

3. Generate a vetor

µ = E[ x | z ]

ôf ônditional ^means ôf âll ^responses

given by

µ = γ ⁻ ¹ ( α ^T z ).

4. Generate allresponses

x

^based^upon^the ^means

µ

^that^were ^alulated

in 3as well asthe sale parameters

φ

^for ^the ^normal^responses.

Wetriedtouseoursimulateddataset diretlytoompute

T

^with ^Mplus,

but even under

H 0

^(the ôrret ^model îs êstimated) ônly ^60% ôf ^the ^simu-

latedsamplesouldprovideanestimatedvaluefor

T

^. ^The^situation^was^even

worse underalternativehypothesis (where aninorret model isestimated),

whereinsomeasesno

T

îsêstimated. ^The^situationîs^not^muh^betterîf^the

sample size beomes large (

n = 10000

^). Îndeed, ônly ^77% ôf ^the ^simulated

samples ould provide an estimated value for

T

ⁱⁿ ^that ^ase. ^This ^might

be due to the fat that the estimation in Mplus is based on the underlying

variable approah. This approah assumes that the manifest variables are

indiret observations of normal underlying variables. When this is not the

ase, the model might not be identied. To avoid this problem, new data

were simulated diretly from Mplus with the same parameters, and resid-

ual variane of the underlying variables dened as

ζ = I _pq − α ˜ ^T α ˜

^, ^where

α ˜

^are ^the ^loadings standardized by their Eulidean norm. These data are

onstrainedtofollowaless generalmodelthantheone generatedinS-PLUS.

Indeed, in the underlyingapproah, new parameters, the residualvarianes,

havetobexedand thenthemodel isonstrainedtohaveapartiular form.

When one works with ordinal data and use

T

^as ^a goodness-of-t, one has to test rst if the underlying approahis adequate, see Muthen (1993).

(36)

alternatives in the two latent variables models and two in the three latent

variablesmodels. Under

H A

^, ^the ^models ^are ^larger ^(more ^general) ^than ^un-

der

H ₀

^. ^More ^preisely^, ^models ^under

H A

âre ^dened ^so ^that êah ^has ône

additionalnonnullloading thanthe previousmodel. Forexample, inmodels

ontainingtwolatentvariables,

M3

^has ^the^same ^loadings^as

M 2

^exept

α ⁽⁵⁾ ₁

that is 0. In the two latent variables models,

M 1

^has ^three ^latent^variables,

but only one additional nonnull loading than

M 2

^, ^whih îs êquivâlent ^to â

model withthe sameparametersbut oneadditionallatentvariablewithor-

responding loadings equals to 0. We have the following results in terms of

empirialleveland power.

Hypothesis level empiriallevel empirialpower

Ω T Ω T

H 0

^0.1 ^0.099 ^0.16

H ₀

^0.05 ^0.050 ^0.11

H ₀

^0.01 ^0.014 ^0.04

H A (M 1)

^0.05 ^0.75 ^0.57

H A (M 2)

^0.05 ^0.42 ^0.25

H A (M 3)

^0.05 ^0.12 ^0.22

Table 5.1: Empirialleveland power for the model with 2 latent variables

Hypothesis level empiriallevel empirialpower

Ω T Ω T

H ₀

^0.1 ^0.103 ^0.045

H 0

^0.05 ^0.052 ^0.021

H ₀

^0.01 ^0.018 ^0.002

H A (M 1)

^0.05 ^0.21 ^0.09

H A (M 2)

^0.05 ^0.10 ^0.06

Table 5.2: Empirialleveland power for the model with 3 latent variables

(37)

Inthe twolatentvariablesmodel,

Ω

outperformsthe S&Bstatistisinterms of empirialleveland power. Theonlyexeptionistheempirialpowerwith

the

M2

^model,^but^this îs^diult^toômpare^beauseôf^the^liberal^behavior

oftheempiriallevelof

T

^.

Ω

^hasâlso^theêxpeted^behaviorⁱⁿ^termôf^power

when the testedmodel isfurther away fromthe null. The 3 latentvariables

model simulations onrm this results. Notie that in this ase the power

is lower beause the alternativeis loser to the null. Indeed, in the 3 latent

variablesmodel, weadd anonullloadingin aloadingmatrix ofsize

(10 × 3)

while thesize ofthe loadingmatrixinthe2latentvariablesmodel is

(5 × 2)

^.

Moreover, the level of

T

^is ^muh ^smaller ^than ^the ^presribed ^nominal ^level

0.05andthismakesthistesttooonservative. Theresultmightbeduetothe

small sample size used in this simulation study. Indeed, the p-value of

T

^is

estimated using itsasymptotidistribution. But,even whenthe sample size

islarger, onehas tobesurethat theassumptions ofthe underlyingvariables

approah are valid tobe able to ompute

T

ⁱⁿ ^pratie.

Clearly,thenewstatistiimprovesgoodness-of-ttestingwithinGLLVM.

At present,

Ω

ând îts âssoiated ^p-values âre êstimated ûsing ^R ôde ^by

alling C ode to ompute the loadings and the sores using the algorithm

presented inHuber, Ronhetti,and Vitoria-Feser (2004). Bothuse a quasi-

Newton proedure (Dennis and Shnabel, 1983) where ompatible stopping

ruleshavetobedenedandan beomputationallyintensive. Onepotential

improvement would be todevelop a stand-alone R library to avoidthis two

step proedure and more eient algorithmstoredue the omputing time.

Thisproedure anbenaturallyextended tolatentvariableswithovari-

(38)

between the distane among the latent sores minus the estimated funtion

of the ovariates and the orresponding distane among the original obser-

vations. Further researh diretions inlude the denition of new distane

funtions for nominal disrete variables, and non-normal distributed vari-

ables. Finally, our GFI and the proedure to evaluate its p-value ould be

extendedtomultilevelmodels,struturalequationsmodelsorombinedmea-

surement models. Indeed, the GFI an be naturally extended to that ase

usingthe generalizedfatorformulation(GF)of Skrondaland Rabe-Hesketh

(2004),but muh workneeds tobe donetoextend the estimationproedure

and the resampling tehniques to these ases.

Appendix A: Simulation Parameters

The loading with the value of 0 for the model under the nullhypothesis are

atually not estimated. This onstraint ensures a unique solution.

A.1: Model with 2 latent variables

Using (2.1), wehave the followingmodel

γ =



 



γ ⁽¹⁾ E(x ⁽¹⁾ | z)

.

γ ⁽⁵⁾ E(x ⁽⁵⁾ | z)



 



=



 

 log

p ⁽¹⁾ _s 1 − p ⁽¹⁾ s

.

Goodness-of-fit for generalized linear latent variables models

Thesis

Reference