Thesis
Reference
Goodness-of-fit for generalized linear latent variables models
CONNE, David
Abstract
Generalized Linear Latent Variables Models (GLLVM) enable the modelling of relationships between manifest and latent variables. These models are widely used in the social sciences.
In a latent variable framework, one works with several unobservable quantities (latent scores, parameters) and it is herefore essential to choose a model as close as possible to the data.
To test the appropriateness of a particular model, ne needs to define a Goodness-of-fit test statistic (GFI). Available GFI can be separated in two groups: first, GFI based on a comparison between the sample covariance and the model covariance of the manifest variables, which implies reducing the information that is contained in the data to their covariance structure, and secondly emph{Pearson}-type statistic when manifest variables are binary. In this work, we propose an alternative Goodness-of-fit statistic based on some distance comparison between the latent scores and the original data. This GFI takes into account the nature of each manifest variable and can in principle be applied in various situations and in particular with models with both discrete and [...]
CONNE, David. Goodness-of-fit for generalized linear latent variables models . Thèse de doctorat : Univ. Genève, 2008, no. SES 681
URN : urn:nbn:ch:unige-67331
DOI : 10.13097/archive-ouverte/unige:6733
Available at:
http://archive-ouverte.unige.ch/unige:6733
Goodness-of-t for Generalized Linear Latent
Variables Models
David Conne
Submitted for the degree of Ph.D in Eonometris and Statistis
Department of Eonometris
University of Geneva, Switzerland
Aepted on the reommendation of:
Dr. Eva Cantoni, University ofGeneva
Prof. Sophia Rabe-Hesketh,UniversityofCalifornia, Berkeley
Prof. ElvezioRonhetti, o-advisor, Universityof Geneva
Prof. Maria-Pia Vitoria-Feser, o-advisor, Universityof Geneva
Thesis n681
risé l'impression de la présente thèse, sans entendre, par là, émettre auune
opinion sur les propositionsqui s'y trouvent énonées et qui n'engagent que
la responsabilitéde leur auteur.
Genève, le9 otobre 2008
le doyen
Bernard MORARD
Generalized LinearLatentVariablesModels (GLLVM)enable the modelling
of relationships between manifest and latent variables. These models are
widely used inthe soialsienes. In alatentvariableframework,one works
with several unobservable quantities (latent sores, parameters) and it is
thereforeessentialtohooseamodelasloseaspossible tothe data. Totest
theappropriatenessofapartiularmodel,oneneedstodeneaGoodness-of-
t test statisti (GFI). Available GFI an be separated intwo groups: rst,
GFI based on a omparison between the sample ovariane and the model
ovarianeof the manifest variables,whihimplies reduing the information
that is ontained in the data to their ovariane struture, and seondly
Pearson-type statisti when manifest variablesare binary. In this work, we
propose analternativeGoodness-of-tstatistibasedonsomedistane om-
parisonbetween the latent soresand theoriginaldata. This GFItakesinto
aount the natureof eahmanifest variable and an inpriniplebeapplied
in various situations and in partiular with models with both disrete and
ontinuous manifest variables. We propose two proedures to ompute the
p-values of ourGFI.The rst one isbased onthe asymptotidistributionof
aU-statistiandappearstobequitediulttoimplementnumerially. The
seond one is based onresampling tehniques and requires aonsistent esti-
matoroftheloadings,thesores,andaorrespondingasymptotiovariane
goodperformane interms ofempirialleveland empirialpower, espeially
ompared to the one proposed by Satorra and Bentler (2001).
Finally, a real dataset is analyzed to highlight the appliation of the
methodology. In most health surveys the state of health of individuals is
measured through several qualitative, disrete quantitative or dihotomi
variables. From these variables, one aims at building univariate indiators
of health that summarize the information. To do so, we propose to use a
GLLVM,inwhihthe latentvariablesarethe healthindiators. Weonsider
the data from the 1997 Swiss Health Survey and we dene a new model
with twohealthindiators. Therst one desribesthehealthstatus indued
merely by the age of the subjet, while the seond one aptures another
dimension ofthe health status. This lattermodelisnot rejeted by our GFI
and gives another insight into the understanding of the health status of the
population.
Les modèles linéaires à variables latentes généralisés permettent de dénir
un lienentre lesvariablesmanifestesetlesvariableslatentes.Ce typede mo-
dèles estbeauouputiliséen sienes humaines. Dansleadred'unmodèleà
variableslatentes,lenombredequantitésinonnuesesttrèsimportant(para-
mètres,sores),ilest dès lorsessentielde hoisir unmodèleaussiprohe que
possibledesdonnées originales.Pourtestersiun modèlepartiulierest perti-
nent,ilfautdénirun testd'adéquation(GFI).Ilexiste un grandnombre de
testsd'adéquationdisponiblesdanslalittératurepourlesmodèlesàvariables
latentes. Ceux-i peuvent être séparés en deux groupes : premièrement, les
GFI basés sur la omparaison entre la matrie de variane-ovariane sous
le modèle à variables latentes et sous le modèle saturé, e qui revient à ré-
duire l'information ontenue dans les données à une matrie de ovariane
et deuxièmement, les statistiques du type de Pearson quand toutes les va-
riables manifestessont binaires.Danse travail,nous proposons un nouveau
testd'adéquationbasésurlaomparaisondesdistanesentrelesobservations
sur l'éhantillon brut et elles données par le modèle.
Ce test peut être en prinipe appliqué ave des variables manifestes de
diérentstypes,enpartiulieravedesvariablesmanifestesdisrètesetonti-
nues. Nousproposons deux tehniques pour évaluer lesp-valeurs de e GFI.
La première est basée sur la distribution asymptotique d'une U-statisti et
teur onvergent des loadings, des sores ainsi qu'une matrie de ovariane
asymptotique orrespondante.
Une étude du omportement de ette statistique à l'aide de simulations
révèle que laperformane de notre statistiqueest bonne en terme de niveau
empirique et de puissane, en partiulier en omparaison de elle proposée
par Satorra and Bentler (2001). Finalement, une appliation sur un jeu de
donnéesréellesestprésentéepourmettreenévidenel'appliationpotentielle
de ette proédure. Dans les enquêtes de santé, la santé des individus est
mesurée àtravers diérents types de variables ommelesvariables ordinales
ou dihotomiques. A partir de es variables, on herhe à onstruire un ou
plusieurs indiesde santé. Nousproposons iid'utiliserles modèles linéaires
à variables latentes généralisés qui permettent d'estimer une ou plusieurs
variables latentes ontinues à partir d'un groupe de variables observables.
Nous onsidérons ii les données issues de l'enquête suisse sur la santé de
1997. Nous proposons un nouveau modèle ave deux indies de santé : le
premier dérit l'état de santé lié à l'âge du sujet et le seond apture une
dimension de la santé indépendante de l'âge. Ce modèle n'est pas refusé
par notre test d'adéquation et permet de d'évaluer la santé sous un angle
nouveau.
1 Introdution 3
2 Generalized Linear Latent Variable Models (GLLVM) 9
3 Goodness-of-tfor GeneralizedLinear Latent VariableMod-
els (GLLVM) 15
3.1 Test Statisti . . . 15
3.2 Derivation of the p-value of the Test . . . 18
3.3 Computing the p-value . . . 20
4 Other GFI for GLLVM 23
4.1 LikelihoodRatio Test . . . 24
4.2 Satorraand Bentler (S&B) GFI . . . 25
5 Simulation study 27
5.1 Disussion of the Results . . . 30
Appendix A: SimulationParameters . . . 31
6 Asymptoti distribution 35
6.1 Asymptotis of
ω 1
. . . . . . . . . . . . . . . . . . . . . . . . . 356.2 Asymptoti distributionof
S
. . . . . . . . . . . . . . . . . . . 366.2.1 Calulation of
E [h(X i 1 , X i 2 )]
. . . . . . . . . . . . . . 366.2.2 Estimationof
E [h(X i 1 , X i 2 )]
. . . . . . . . . . . . . . 436.2.3 Estimationof
ζ 1
. . . . . . . . . . . . . . . . . . . . . . 466.3 Disussion of the Results . . . 47
Appendix B: U-statisti of order 2. . . 48
Appendix C: Expetationestimators . . . 49
7 Appliation: A Latent Variable Approah for the Constru- tion of Continuous Health Indiators 52 7.1 Introdution . . . 52
7.2 Generalized Linear Latent VariableModels . . . 54
7.3 Health Indiators forthe 1997 Swiss Health Survey . . . 56
7.4 Conlusion . . . 62
Appendix D: Model and Estimated Parameters . . . 63
Referenes 67
Introdution
In many sienti elds, theoretial onepts that annot be measured di-
retly are dened by meansofobservableindiators. Someexamples inlude
intelligeneinpsyhologyorwelfareineonomis. Inthesesituations,onese-
letsobservablevariablessupposedtobelinkedtotheunobservablevariables.
The objetive istolinkthe unobservable variables, alledlatentvariables to
the observed ones, alled manifest variables. Jöreskog (1969) proposed a
model based on a linear link between the latent variables assumed to be
normaland the manifestvariablesassumed tobeonditionallynormalgiven
the latentones. Thisonditionallowstoomputeexat maximumlikelihood
estimatorsusing theovarianematrix ofthemanifestvariables. Thismodel
is implemented in well-known and widely used softwares suh as LISREL
(Jöreskog, 1990) or Mplus (Muthen and Muthen, 2001). However, in eo-
nomis and soial sienes, researhers very often work with surveys whih
ontain responses measured on ordinal or binary sales. When the man-
ifest variables are not normally distributed, the methods implemented in
these softwares suppose thatthe manifest variablesare indiretobservations
variableapproah, doesn't take diretly intoaountthe atualdistribution
ofthevariables,exeptwhenallmanifestvariablesarebinarywiththeprobit
link.
Bartholomew (1984) and Moustaki and Knott (2000) proposed to drop
the assumption of multivariate normaldistribution with the speiation of
theGeneralizedLinearLatentVariablesModel(GLLVM).Thismodelallows
to onsider all distributions belonging to the exponential family, whih in-
ludes both ontinuous and disrete distributions,suhas normal,binomial,
Poisson,et. WithGLLVM,the likelihoodfuntiondependson multivariate
integralswhihannotbealulatedanalytially. Tooveromethis problem,
Moustaki and Knott(2000) deneestimators based onaGauss-Hermite ap-
proximationofthese integrals. Unfortunately,this methodisofteninfeasible
when the number oflatentvariablesislarger than 2. Huber, Ronhetti, and
Vitoria-Feser (2004) propose instead to use a Laplae approximation and
dene new estimators whih an be viewed as M-estimators (Huber, 1981).
This allows onsistent estimation and inferene even in the presene of a
large numberof latentand manifestvariables. Furthermore,the Laplaeap-
proximation allows to dene the h-likelihood sores on eah latent variable,
the asymptoti properties of whih an be found in Lee and Nelder (1996).
These estimated sores an also be viewed as penalized quasi likelihood es-
timators (Green, 1987). Moreover, inthe Bayesian approah, the estimated
soresarethemodaloftheposteriordistributionofthesoreswithestimated
parameters. These estimators are alled the EmpirialBayes modal(EBM)
bySkrondalandRabe-Hesketh(2004),ormodalaposteriori(MAP)byBok
(1985). Forageneraloverview onlatentvariablemodeling,seeSkrondaland
Inalatentvariableframework,oneworkswithseveralunobservablequan-
tities (latent sores, parameters) and it is therefore essential to hoose a
model as lose as possible to the data. More speially, the number of la-
tentvariablesislearlyunknown,aswellasthefatthatapartiularmanifest
variableis linked or not to a partiular latent variable. To ompare models
with a dierentnumber of latent variables, one ould use the Akaike(1973)
riterion, whih is a powerful measure of relative t; see Conne (2003) for
some numerial omparison in this ontext. However, this riterion suers
fromanimportantdrawbakwhenapplieddiretlytolatentvariablesmodels
inludingthe denitionofthe likelihoodand the roleof thesoresas param-
eters. For a disussion and a proposition of a orreted Akaike riterionfor
mixed-eetsmodels, see e.gVaida andBlanhard(2005). Even if oneould
extendthis riteriontolatentvariablesmodels,itwouldn'tdesribeabsolute
t and wouldn't test the appropriateness of a partiular model. To do so,a
Goodness-of-tindex (GFI) is neessary.
Several GFIare available for latent variable models in the literature. In
the ase of multivariate normal manifest variables, a natural hoie is the
likelihood ratio test (LRT) on the ovariane matrix, see e.g Bartholomew
and Knott (1999). More preisely, the LRT is dened as the dierene be-
tweenthelog-likelihoodevaluatedwiththe ovarianematrixonstrainedby
the latent model and thatevaluated atthe unonstrained ovariane matrix
(saturated model). When manifest variables are not multivariate normal,
one an use a funtion of the ovariane matrix of the underlying normally
distributed manifest variables to dene a Goodness-of-t. Indeed, the un-
onstrained ovarianematrix an be estimated withpolyhoriorrelations
method suers from two drawbaks: rstly, using polyhori or polyserial
orrelations implies reduing the information that is ontained in the data
to their ovariane struture. Sine the ovariane matrix is not a suient
statisti when the variables are ordinal or binary, this implies a loss of in-
formation. Moreover, the estimation of polyhori or polyserial orrelations
impliesthe estimationofanumberof parametersthatinreases rapidlywith
the number of manifest variables. Seondly, this method is strongly de-
pendent on the normality assumption of the underlying manifest variables.
Therefore, a GFIbased ona funtionof the ovariane matrix ofthe under-
lying normally distributed manifest variables in a non-normal setting tests
simultaneously the normalityassumption of the underlyingvariable and the
t of the GLLVM struture, exept when all manifest variables are binary,
see e.g. Muthen (1993).
In the binary ase, the data an be represented by a ontingeny table
and a Pearson-type statisti an be derived. However, for a moderate sam-
ple size and animportantnumberof variables,one isfaed with theurse of
dimensionality problemand this methodbeomes unfeasible. In the ordinal
ase, the number of ells an be large and the problem is even worse. To
overome this problem, Glas (1988), Reiser (1996) or Bartholomew and Le-
ung (2002) propose to use only informationonlower order margins(usually
rst and seond order). This method an only be used if all variables are
binary orordinal, but it is unlear howto extend it to mixtures of dierent
type of manifest variables.
Othertypesofgoodness-of-tstatistisavailableinsoftwareslikeLISREL
and MPLUS, are the RNI, RMSEA or SRMR. All of them are funtions of
Moreover, if the manifest variables are not multivariate normal, their null
asymptoti distributions are not available. They are used with empirial
ut-o riteria and their p-values are not uniformly distributed under the
null hypothesis,see Marsh, Hau,and Wen (2004).
We propose an alternative goodness-of-t statisti whih is not based
on the omparison between the orrelation matries omputed respetively
diretly from the data and through the estimated GLLVM, but on some
distane omparison between the latent sores and the original data. The
onept of distane is widely used in the ontext of luster analysis to nd
subjets that are similar,see e.g Kaufman and Rousseeuw (1990). Our GFI
is based on the idea that two subjets that are very dissimilar in the data
spae should also be very dissimilarin the latent sores spae, if the hosen
model (under the null hypothesis)is orret.
OurGFIaninpriniplebeappliedinvarioussituationsandinpartiular
with models with both disrete and ontinuous manifest variables. For sim-
pliity we develop here our GFI in the ase of independent latent variables.
However, theproedure anbeextended totheaseoforrelatedlatentvari-
ables by using the estimator of the parameters and the sores whih was
proposedby Huber, Ronhetti, andVitoria-Feser(2004),p. 900. Thelatter
is anextension of the independent ase presented inChapter 2 tothe orre-
latedase. TheasymptotidistributionofourGFIunderthenullhypothesis
is derived but it appears to be quitediult to implementnumerially. We
propose instead to ompute p-values using dierent resampling tehniques
that we optimize for omputational speed. Our simulations show that the
p-values obtained by this proedure have a uniform distribution under the
tives that are near the null model. Finally, we ompare our proedure in
terms of empirial levels and powers to a GFI implemented in the software
MPLUS and show that our approah improves onsiderably goodness-of-t
testing within GLLVM.
The thesis is organized as follows: in Chapter 2, we briey review the
GLLVM and the estimation tehnique based on the Laplae approximation
developed by Huber, Ronhetti,and Vitoria-Feser(2004). In Chapter3,we
propose a new GFI based on distane omparison and a proedure to om-
puteassoiatedp-values. DierentavailableGFIare reviewed andompared
in Chapter 4. Asimulationstudy omparing our GFI and the one proposed
bySatorraandBentler(2001)ispresentedinChapter5. Thisshowsthatour
GFIhas betterperformane interms ofempirialleveland empirialpower.
In Chapter 6, we present a proedure to approximate the asymptoti dis-
tribution of this Goodness-of-ttest statisti. This methodis based on the
asymptoti distribution of a U-statisti whih requires the alulation and
the estimation of the rst and seondmomentsof the test statisti. Finally,
a small simulation study is presented in Setion6.3. From a pratialpoint
of view, the asymptotis approximation is not aurate enough to be used
for typial sample sizes enountered in real data sets and so this approxi-
mation should not be used in pratie. In Chapter 7, as an appliation, we
onsider the data from the 1997 Swiss Health Survey and build two health
indiators. A latent model is dened and tested with our GFI. This latter
model gives another insight into the understanding of the health status of
the swiss population.
Generalized Linear Latent
Variable Models (GLLVM)
The relationship between
p
manifest variablesx (j) , j = 1, .., p
andq
latentvariables
z (k) , k = 1, ..., q
,q < p
, isformalized inthe samemanneras ingen-eralizedlinearmodels(MCullaghandNelder,1989)bymeansofonditional
distributions
g j (x (j) | z)
belongingto the exponential family,i.e.g j (x (j) | z) = exp
x (j) α (j)T z − b(α (j)T z) φ j
+ c(x (j) , φ j )
,
where
α (j) ∈ ℜ q+1
are theloadings,φ j
isasaleparameter andthefuntionsb(α (j)T z)
andc(x (j) , φ j )
depend on the spei distributiong j (x (j) | z)
. Wedenealinearlinkbetween afuntionoftheonditionalexpetationof
x (j) | z
and
z
asγ (j) E(x (j) | z)
= α (j)T z
, whereα (j)T = (α (j) l ) l=0,1,...,q ,
(2.1)z = (1, z (2) ) T , z (2) = (z (1) , ..., z (q) ) T .
(2.2)We give here the spei funtion
b
,c
andγ
and the sale parameterφ
fornormal and ordinalonditional distribution
g j (x (j) | z)
.•
Normal manifest variablesLet
x (j) | z
have a normal distribution with meanµ
and varianeσ 2
.The link funtion
γ()
is the identity funtionγ (j) E(x (j) | z)
= E(x (j) | z) = µ = α (j)T z,
b(α (j)T z) = (α (j)T z) 2
2 ,
c(x (j) , φ j ) = − 1 2
x 2
φ + log(2πφ)
and
φ = σ 2
.•
Ordinal manifest variablesLet
x (j) | z
follow an ordered multinomial distribution with ategories goingfrom1toM (j)
. Thelinkfuntionanbehosenasalogitfuntionγ (j) p s
1 − p s
= α (j)T s z,
where
p s
isdened asthe umulativeprobabilityofaresponsex (j) | z
tobe
s
or less, wheres = 1, ..., M (j)
. Thes
index onα (j) s
indiates thatthe rstomponent
α (j) 0,s
of thevetorα (j)
isathresholdthat isrelatedto eah ategory.
b(α (j)T z) = log p s+1
p s+1 − p s
,
c(x (j) , φ j ) = 0
The main assumption in the GLLVM is the onditional independene of
the manifest variables given the latent ones. Hene, the joint onditional
distributionis
Q p
j=1 g j (x (j) | z)
. Assumingfurtherthatthelatentvariablesareindependentanddistributedasstandardnormalvariables,thejointmarginal
distribution of the manifest variables is given by
f α,φ ( x ) =
Z " Y p
j=1
g j (x (j) | z )
#
h( z (2) )dz (2) ,
(2.3)where
h( · )
isthe density of aq
-dimensional standard normalvariable. Given a sample ofn
observationsx i = (x (1) i , ..., x (p) i ) T
,i = 1, .., n
, the log-likelihood of the
(q + 1) × p
loadings matrixα
and thep
-vetor of saleparameters
φ
isℓ(α, φ | x) = P n
i=1 log f α,φ (x i )
. This expression ontains amultidimensionalintegralwhih annotbeomputed expliitly,exeptwhen
x | z
isdistributedaordingtoamultivariatenormal. Huber,Ronhetti,and Vitoria-Feser (2004) propose to use a Laplae approximation of (2.3) anddeneanestimatoralledLAMLE. Thisidea hasbeen usedinseveral elds,
inluding the Bayesian approah to approximate the posterior distribution
(see e.g. Tierney and Kadane, 1986) and in a simpliedform ingeneralized
linear mixed models (GLLAMM) (see Breslow and Clayton, 1993).
Sine our GFI will be based on the LAMLE, we give here the denition
and the properties of the LAMLE. For more details, see Huber, Ronhetti,
and Vitoria-Feser(2004). Byrewriting the density of
x i
in (2.3)asf α,φ (x i ) = Z
e p · Q(α,z,x i ) dz (2) ,
(2.4)Q(α, z, x i ) = 1 p
h X p
j=1
x (j) i α (j)T z − b j (α (j)T z) φ j
+ c j (x (j) i , φ j )
− z T (2) z (2) 2 − q
2 log(2π) i
,
(2.5)and by applying the Laplaeapproximationto (2.4), we obtain
f α,φ (x i ) = 2π
p q 2
det( − W ( z ˆ i(2) )) − 1 2 e p Q(α, z ˆ i ,x i ) (1 + O(p − 1 )),
(2.6)where
W (z) = ∂ 2 Q( α, z, x i )
∂z T (2) z (2) = − 1
p Γ (α, φ, z), Γ (α, φ, z) =
X p
j=1
1 φ j
∂ 2 b j (α (j)T z)
∂z T z + I q .
(2.6) depends on the unknown quantity
z ˆ i
,the maximum of the funtionQ
dened through
∂Q( α, z b i , x i )
∂ z b i = 0,
(2.7)and an be estimated iteratively by means of
ˆ
z i(2) = ˆ z i(2) (α, φ, x i ) = X p
j=1
1 φ j
x (j) i − ∂b j (α (j)T z ˆ i )
∂α (j)T z ˆ i
α (j) (2) ,
(2.8)where
α (j) = (α (j) 0 , α (j)T (2) ) T
. It should be noted thatz b i(2)
an be interpreted as the maximum likelihoodestimators of the latent sores. Indeed, if thez i(2)
were onsidered as parameters, the rst derivative of the likelihood with respet toz i(2)
for xedα
andφ
leads exatly tothe expression (2.7).who show that as
n → ∞
n 1 2 ( z b i(2) − z i(2) ) → D N (0, Γ (α, φ, z i ) − 1 ).
(2.9)From (2.5)and (2.6), we obtainthe Laplae approximated log-likelihood
funtion
˜ l(α, φ, | x) = X n
i=1
− 1
2 log det { Γ (α, φ, z ˆ i ) } − z ˆ T i(2) z ˆ i(2) 2 +
X p
j=1
( x j i α (j)T z ˆ i − b j (α (j)T z ˆ i ) φ j
+ c j (x j i , φ j ) )
.
(2.10)The resulting loadingsestimatorof
α
alledthe LAMLE isthe solution of∂ ℓ(α, ˜ φ | x)
∂α kl
= X n
i=1
ψ(x i ; α, z b i ) = 0, k = 1, ..., p,
(2.11)where
ψ(x i ; α, z b i ) = − 1 2
trΓ (α, φ, z b i ) − 1 ∂ Γ (α, φ, z b i )
∂α kl
+ 1
φ k
x (k) i − ∂b k (α T k z b i )
∂α T k z b i
ˆ
z il ,
(2.12)with
b z i
given impliitlyby (2.8). Equation (2.11)whihdenes the LAMLEmay have multiple solutions. If
q > 1
, it is neessary to imposeq(q 2 − 1)
on-straintson the parameters
α
to obtaina unique solution.The LAMLE
α ˆ
belongs to the lass of M-estimators (Huber, 1981), and under the onditions given in Huber(1981), asn → ∞
n 1 2 ( ˆ α − α) → D N (0, V (α)),
(2.13)where
V ( α ) = B( α ) − 1 A( α )B( α ) − T ,
(2.14)A(α) = E
ψ(x; α, b z)ψ T (x; α, z) b
,
B(α) = − E ∂ψ(x; α, b z)
∂α ,
and the expetations are taken under the GLLVM model. Formore details,
speiallyforthespeiestimationequationsinthenormalandordinalases,
see Huber, Ronhetti, and Vitoria-Feser (2004)and Huber(2004).
Goodness-of-t for Generalized
Linear Latent Variable Models
(GLLVM)
3.1 Test Statisti
TheobjetiveofaGFIistomeasurethedistanebetweenasuitablequantity
omputed fromthe sampleand itsestimated ounterpartusing the assumed
model. ThebasiideaofourGFIistoomparethedistaneamongthelatent
sores and the orresponding distane among the originalobservations. The
latentsoresrepresentinawaythemappingoftheobservationsonthelatent
variablespae. Hene,if the model is adequate, toa distane between two
observationsintheoriginaldataspaeshouldorrespondasimilardistane
onthelatentspae. Clearly,weneedtodeneageneraldistane measure on
the latent spae and onthe data spae while taking intoaount the nature
of the dierent variables. We propose here to use the distanes developed
aordingtotheGLLVM,eahobservation
x i
hasaorresponding(unknown)latentsore
z i
estimatedbyz ˆ i
bymeansof(2.7). Letd q (ˆ z i 1 , z ˆ i 2 )
beadistanefuntion onthe sores spaeand
d ˜ p ( x i 1 , x i 2 )
adistane funtion onthe dataspae. Sine
z ˆ
is ontinuous, a natural hoie ford q ( · , · )
is the Eulideandistane standardizedby the standard deviation of
z ˆ i
, i.e.d q (ˆ z i 1 , ˆ z i 2 ) = 1 q
v u u t
X q
j=1
ˆ
z i (j) 1 − z ˆ i (j) 2
˜ σ z (j)
! 2
,
(3.1)where
σ ˜ z (j) = q
1 n
P n
i=1 (ˆ z (j) i − z ˆ (j) ) 2
is the empirial standard deviations ofthe
z ˆ (j) i
, thej
omponent of the vetorˆ z i
. In the sample spae, ifx (j)
is normally distributed, the Eulidean distane funtion is also suitable for
d ˜ p ( · , · )
. Whenx (j)
is ordinal, a standard hoie is the Manhattan distane(
L 1
distane) on the ranksr i (j)
of the observations. Hene for a model withp 1
normal manifestvariablesandp 2
ordinalmanifest variables,we haved ˜ p (x i 1 , x i 2 ) = 1 p 1
v u u t
p 1
X
j=1
x (j) i 1 − x (j) i 2
˜ σ x (j)
! 2
+ 1 p 2
p 2
X
j=1
r (j) i 1 − r (j) i 2
n 2
,
(3.2)where
σ ˜ (j) x
is the empirialstandard deviationsofx (j)
andn 2
is asale fatorfor the ranks orresponding to the maximum of the dierenes
r i (j) 1 − r i (j) 2
,whih has the same order as the variane of
r i (j)
. Consequently, a natural GFI isdened byS(x, z ˆ | α) = ˆ 1
n 2 − n 2
X n
i 1 =1
X n
i 2 =1 i 1 >i 2
h
d q (ˆ z i 1 , z ˆ i 2 ) − d ˜ p (x i 1 , x i 2 ) i 2
= 1
n 2 − n 2
X n
i 1 =1
X n
i 2 =1 i 1 >i 2
1 q
v u u t
X q
j=1
ˆ
z i (j) 1 − z ˆ i (j) 2
˜ σ z (j)
! 2
− 1 p 1
v u u t
p 1
X
j=1
x (j) i 1 − x (j) i 2
˜ σ (j) x
! 2
− 1 p 2
p 2
X
j=1
r (j) i 1 − r (j) i 2
n 2
2
.
(3.3)Basially,thisGFIisanaveragesquareddierenebetweenageneraldistane
on the sample spae and its estimated ounterpart on the latent spae. It
implies that only the latent sores matrix is used in the model part of our
GFI.Otherdistanesanbespeiedfor
d q ( · , · )
andd ˜ p ( · , · )
. However, Conne(2005) shows in a simulationstudy that
S
has agoodperformane intermsof empirialpower ompared tothe GFI basedon other distanes.
Sine the distribution of
S
depends onα ˆ
throughz ˆ
, in order to obtainorret inferene,
α ˆ
is integrated out usingits asymptotidistributiongiven by (2.13). It turns out that this orresponds to making a orretion onS
based on a distane between two estimated asymptoti ovariane matries
of
α ˆ
, namelyV ( ˜ α)
andV ( ˆ α)
. This leads to the followingGFIΩ = 2
det
V ˆ ( ˜ α) det
V ˆ ( ˆ α)
1 2
· S(x, z ˆ | α). ˆ
(3.4)This orretion fator willbe derived inthe next setion.
p-valuesomputedusing
ν S | α ˆ (s | α) ˆ
,the onditionaldensityofS
givenα ˆ
,willdependon
α ˆ
. Inordertoobtainorretunonditionalinferene,weonsiderˆ
α
as a nuisane parameter, and we integrate outα ˆ
using its asymptotinormal distributiongiven by (2.13),i.e.
f S (s) = Z
ν S | α ˆ (s | α) ˆ · h
V − 1 2 (α)(vec( ˆ α) − vec(α))
| det(V (α)) | − 1 2 d α ˆ
= 1
(2π) p 2 ˜
1
| det(V (α)) | 1 2 · Z
ν S | α ˆ (s | α) ˆ · exp (˜ p · κ( ˆ α)) d α, ˆ
(3.5)where
κ( ˆ α) = − 1
2˜ p · vec( ˆ α) − vec(α))V − 1 (α)(vec( ˆ α) − vec(α) T
,
h()
is the density funtion of the standard normalandp ˜ = dim(vec( ˆ α ))
.The term outside the integral depends on an unknown matrix
V (α)
,whihwill be estimated by
V ˆ ( ˜ α)
withα ˜
dened below.Moreover, the maximum of the funtion
κ( ˆ α)
isahieved atα ˆ = α
withκ(α) = 0
. Applyingthep ˜
-dimensionalLaplaeapproximationtotheintegral in (3.5), weobtainZ
ν S | α ˆ (s | α) ˆ · exp (˜ p · κ( ˆ α)) d α ˆ = 1 2 q
| det( − 1 p ˜ V − 1 (α)) | · ν S | α ˆ (s | α) · 2π
˜ p
p 2 ˜
{ 1 + O(˜ p − 1 ) }
(3.6)f S (s) = 1 (2π) p 2 ˜
1
| det(V ( ˜ α)) | 1 2 · 1 2 q
| det( − 1 p ˜ V − 1 (α)) |
ν S | α ˆ (s | α) 2π
˜ p
p 2 ˜
{ 1 + O(˜ p − 1 ) }
= 1 2
| det(V (α)) |
| det(V ( ˜ α)) | 1 2
· ν S | α ˆ (s | α) { 1 + O(˜ p − 1 ) } .
(3.7)α
is unknown and willbe estimated byα ˆ
, see (2.11). Finally, weobtainf S (s) = 1 2
| det(V ( ˆ α)) |
| det(V ( ˜ α)) | 1 2
· ν S | α ˆ (s | α ˆ ) { 1 + O(˜ p − 1 ) }
(3.8)whihdenes the orretion fator of the GFI
S
, leadingtoΩ
in (3.4).Forreasonsof numerialstability and sine the log-likelihoodfuntionis
approximatedby
˜ l
in (2.10),weuseV ˆ ( α ) = 1
n 2 X n
i=1
"
∂ ˜ l(α, φ | x i )
∂α
T
· ∂ ˜ l(α, φ | x i )
∂α
#! − 1
.
(3.9)instead of
1
n V (α)
the asymptotiovarianematrix given by (2.14). To getV ˆ ( ˆ α)
,α
is replaed byα ˆ
in (3.9). (3.9) is preferred ton 1 V (α)
beause thederivative of
ψ
appears tobe very unstable insimulations.Note that if
α ˆ
andα ˜
are the same estimator, the orretion fator be-omes simplytwo. Sine our empirialexperiene shows that aorretion is
deisive in having a orret inferene, we propose to onsider two dierent
estimators
α ˆ
andα ˜
inV ˆ (α)
,whereα ˜
hasasmaller variane. Forthat,sev-eral tehniques ould in priniple be onsidered, but we propose to use the
bagging proedure (Breiman, 1996). Our simulation study shows that this
hoie isadequateatleast forthemodelwehaveinvestigated. The omplete
algorithmis presented in the next setion.
the distribution of
ν S | α ˆ (s | α) ˆ
. In Chapter 6,the nullasymptotidistribution ofΩ
isderived whihisshown tobenormalwith ratherompliatedexpres-sions forthemeanand variane. Toomputethe latter,oneneeds numerial
approximations whih makes inferene quite unstable if not inappropriate.
We prefer instead to approximate
ν S | α ˆ (s | α) ˆ
by means of resampling meth-ods. Parametri bootstrap has been widely used in goodness-of-t testing
as for example by Romano (1988). However, a diret parametri bootstrap
would betoo omputer intensive,beause
α ˆ
andz ˆ i
need tobe omputed ateahbootstrappedsample. Therefore, followingasimilarideaasinSalibian-
Barrera and Zamar (2002), we propose a fast parametri bootstrap that
avoids the omputationof
α ˆ
ateah bootstrapped sample.3.3 Computing the p-value
First, we need to estimate
V ˆ ( ˜ α)
using the bagging proedure whih anbe summarized in the following way. Let
y = (x, z) ˆ
be a data set withorresponding estimated sores.
Repeat for
b = 1, .., B
:1. Generate arandomsample
y ⋆ b = (x ⋆ b , z ˆ ⋆ b )
ofsizen
fromy
with replae-ment.
2. Estimate the loadings
α ˜ ⋆ b
orresponding to the sampley ⋆ b
using (2.11)with
z ˆ ⋆ b
xed.3. Evaluate
V ˆ ⋆ b ( ˆ α ⋆ b | y ⋆ b )
with (3.9).V ˆ ( ˜ α) = 1 P B V ˆ ⋆
Note that in step 2. we ompute the loadings
α ˜ ⋆ b
onditionally on the originalz ˆ ⋆ b
. Oneould reestimatebothα ˜ ⋆ b
andz ˆ ⋆ i
butourproedure ismuhfaster and stable.
The fast parametri bootstrap we propose an be summarized in the
followingway. Let
x
bethe data set supposed tobe generatedby aGLLVMmodel. Let also
z ˆ
, andα ˆ
andφ ˆ
be the orresponding estimated sores, loadings and sale parameters respetively andV ˆ ( ˆ α)
the ovariane matrixevaluated with (3.9), using
α ˆ
,φ ˆ
andz ˆ i
,i = 1, ..., n
.Repeat for
b = 1, .., B
:1. Generate one
α ⋆ b
from its estimated asymptoti distributionα ⋆ b ∼
N
ˆ
α, V ˆ ( ˆ α)
.
2. Generate
q
independent standard normal vetorsz
of sizen
.3. Generate a vetor
µ = E[x | z]
of onditional means of all responsesdened by
γ ( µ ) = α ⋆T b z.
(3.10)4. Generate the bootstrapped sampleof manifest variable
x ⋆ b
based uponthe meanthat were alulatedin(3.10)aswellasthe saleparameters
φ ˆ
for the normalresponses.5. Given the bootstrapped sample, estimate
z ˆ ⋆ b
onditionallyonα ⋆ b
with(2.7).
6. Evaluate
V ˆ (α ⋆ b | x ⋆ b , z ˆ ⋆ b )
with (3.9).Ω ⋆ b = 2 det( ˆ V (α ⋆ b )) 1 2
· S(x ⋆ b , z ˆ ⋆ b | α ⋆ b )
p − \ value = 1
B # { Ω ⋆ b > w } ,
where
w
is the observed value ofΩ
omputed on the originalsample. Notethat instep 5. both
z ˆ ⋆ b
andα ⋆ b
ould bereestimated but this would inrease the omputational time without improvement on the performane in termsof p-values. The variability of
α ˆ
is taken into aount inthe rst step. Oneould also use
V ˆ ( ˜ α)
as a ovariane matrix estimate tosimulateα ⋆ b
in step1. However, under the null hypothesis, the p-values assoiated with this
proedure donot seem tobe aslose touniformasthe one presented above.
Finally,itshouldbestressed thatthisstatistiandtheproedure toeval-
uate its p-value is widely appliable. Indeed, if one uses another onsistent
estimator of
α
,z
and a orresponding asymptoti ovariane matrix, one ould apply the same proedure to dene a goodness-of-t index and itsorresponding p-value.
Other GFI for GLLVM
In this hapter, we present the LRT in the GLLVM framework and the
SatorraandBentler(S&B)Goodness-of-t(SatorraandBentler,2001). Cut-
oriteriasuhasRMSEAorSRMR arenot presentedbeausetheyare not
omparable toour GFIoutside the ase of normalmanifest variables.
Inthebinary ase, Pearson-typestatistisareused. They arebased ona
omparison between the empirialfrequenies andthe estimated frequenies
under the model. Pearson-typestatistisrequire a large numberof observa-
tions in eah ell of the ontingeny table for their asymptoti distribution
to hold. To avoid this problem of sparsity, statistis similar to Pearson's
but using only information from lower margins, usually bivariate margins,
are available, see e.g Bartholomew and Leung (2002) or Maydeu-Olivares
and Joe (2005). However, if there are signiant interations among higher
order margins, the GFI may not rejet false models. Moreover, in the ordi-
nal ase, this method isonly appliable when the number of observations is
large. Sine the models we use inour simulation are too omplex for GFI's
based on Pearson-type statisti, they will not be ompared to our GFI in
Suppose rst that the manifest variablesare onditionallynormal given the
latentones, i.e.
x | z ∼ N p (α T z, ζ),
(4.1)where
ζ
is adiagonal matrix. Letvar(x) = Σ
, thenΣ = α T α + ζ
and thelikelihoodfuntion
l( Σ − 1 ) ∝ n 2
log | Σ − 1 | − trace[ Σ − 1 C]
,
where
C
isthe empirialovarianeororrelation matrix. Suppose we haveestimators
α ˆ
andζ ˆ
forα
andζ
. Then if the number of latent variablesq
is known or xed, the likelihood ratio statisti for the null hypothesisH 0 : Σ = α 0 T α 0 + ζ 0
, or equivalentlyH 0 : x ∼
Np µ, α T 0 α 0 + ζ 0
against
the alternativethat
Σ
isunonstrained isn h
− log | Σ ˆ − 1 C | + trace[ Σ ˆ − 1 C] − p i
,
(4.2)where
Σ ˆ = ˆ α T α ˆ + ˆ ζ
. Ifthe manifestvariablesareonditionallynormal,this statistiisasymptotiallydistributedasaχ 2
with1 2 [(p − q) 2 − (p+ q)]
degreesof freedom.
If the manifest variables are not (onditionally) normal, the problem of
estimating the likelihood under the alternative arises. In the binary ase,
the multivariatedensity has
2 p − 1
parameterswhihan be estimatedusingmoment estimators of
E[x (j 1 ) · x (j 2 ) ], j 1 6 = j 2
; see e.g Teugels (1990). Evensuers from two drawbaks. First, this proedure annot be applied to a
mixtureofontinuous anddisretemanifestvariables,beausethelikelihood
under thealternativewould bealmostimpossible toderive. Seondly, inthe
ordinalase,thenumberofobservationsneedstobeverylargeifthenumber
of manifestvariablesismoderatetohigh. Indeed,the numberofparameters
inreases very quikly with the number of manifest variables.
4.2 Satorra and Bentler (S&B) GFI
Thisstatistiisbasedonaomparisonbetweenthesampleovariane
C
andthe model ovariane
Σ
of the manifest variablesx = (x (1) , ..., x (p) )
. If themanifest variables are a mixture of ordinal and normal variables, the usual
estimatorofthesampleovarianematrix
C
isreplaed bythe polyhori orpolyserialovarianematrix, seee.g Qu, Piedmonte, andMedendorp(1995).
Let
s
andσ
be the vetor ontaining all the distint values ofC
andΣ
respetively, and
Υ
the asymptoti ovariane matrix of√ n(s − σ)
. Letθ
be the vetor ontaining all the parameters of the GLLVM. Consider
σ(θ)
and the funtion
F ( θ ) := ( s − σ ( θ )) T Υ ˆ − 1 ( s − σ ( θ )),
(4.3)where
Υ ˆ − 1
isa onsistentestimator ofΥ − 1
. Then the S&B statistiisT = n F (s, σ(ˆ θ))
c ,
(4.4)where
θ ˆ
isthe weighted leastsquares estimator(WLS) whih minimize(4.3)and
c = dim(θ) r 0 ,
where
r 0 = dim(σ) − dim(θ)
. ThenT → D
r 0
X
j=1
λ j (χ 2 1 ) j ,
(4.5)as
n → ∞
, where the(χ 2 1 ) j
are independent hi-squares variables with one df and theλ j
are the nonnull eigenvalues of the matrixU 0 Υ
, withU 0 = ˆ Υ − 1 − Υ ˆ − 1 ∆ ( ∆ T Υ ˆ − 1 ∆ ) − 1 ∆ T Υ ˆ − 1 ,
and
∆ = ∂θ ∂ T σ(θ)
.In the ase of disrete manifest variables, the problem that the informa-
tion drawn from the sample is redued to the estimated ovariane matrix
still remains. If we ompare our statisti with (4.2) or (4.4), both measure
a distane onsisting ina model part and a sample part. The model part of
Ω
in (3.4) depends onthe(n × q)
matrix ofz ˆ
while the model part ofT
in(4.4) or the LRT based on (4.2) depend only on the
(q × p)
matrix of theestimated loadings. Sine the S&B statisti is widely used in pratie, we
will ompare itsperformane to
Ω
inthe simulation presented in hapter 5.Simulation study
In this Chapter, we study the behavior in terms of level and power of our
GFI
Ω
and ompare it to the behavior of the S&B GFIT
.T
and its as-soiated p-value, omputed by means of the asymptoti distribution given
in (4.5), are omputed with Mplus. We onsider models ontaining 2 and 3
latentvariables. 10000samples ofsize 100 (for themodelwith 2latent vari-
ables) and 5000 samples of size 200 (for the model with 3 latent variables)
were simulated. They ontain respetively 5ordinalmanifestvariablesanda
mixture of 5ordinal manifest variablesand 5normal manifest variables. 30
randomsampleswere generatedinthe baggingproedure and100 bootstrap
samples inthe fastparametri bootstrap.
The samples of size
n
are generated in S-Plus using the following proe-dure
1. Initialize allthe parameters:
• p(q + 1)
elements of the matrixα
,• p 1
varianesdening the vetorφ
.• (s − 1)p 2 α s
2. Generate
q
independent standard normal vetorsz j
of sizen
.3. Generate a vetor
µ = E[ x | z ]
of onditional means of all responsesgiven by
µ = γ − 1 ( α T z ).
4. Generate allresponses
x
baseduponthe meansµ
thatwere alulatedin 3as well asthe sale parameters
φ
for the normalresponses.Wetriedtouseoursimulateddataset diretlytoompute
T
with Mplus,but even under
H 0
(the orret model is estimated) only 60% of the simu-latedsamplesouldprovideanestimatedvaluefor
T
. Thesituationwasevenworse underalternativehypothesis (where aninorret model isestimated),
whereinsomeasesno
T
isestimated. Thesituationisnotmuhbetterifthesample size beomes large (
n = 10000
). Indeed, only 77% of the simulatedsamples ould provide an estimated value for
T
in that ase. This mightbe due to the fat that the estimation in Mplus is based on the underlying
variable approah. This approah assumes that the manifest variables are
indiret observations of normal underlying variables. When this is not the
ase, the model might not be identied. To avoid this problem, new data
were simulated diretly from Mplus with the same parameters, and resid-
ual variane of the underlying variables dened as
ζ = I pq − α ˜ T α ˜
, whereα ˜
are the loadings standardized by their Eulidean norm. These data areonstrainedtofollowaless generalmodelthantheone generatedinS-PLUS.
Indeed, in the underlyingapproah, new parameters, the residualvarianes,
havetobexedand thenthemodel isonstrainedtohaveapartiular form.
When one works with ordinal data and use
T
as a goodness-of-t, one has to test rst if the underlying approahis adequate, see Muthen (1993).alternatives in the two latent variables models and two in the three latent
variablesmodels. Under
H A
, the models are larger (more general) than un-der
H 0
. More preisely, models underH A
are dened so that eah has oneadditionalnonnullloading thanthe previousmodel. Forexample, inmodels
ontainingtwolatentvariables,
M3
has thesame loadingsasM 2
exeptα (5) 1
that is 0. In the two latent variables models,
M 1
has three latentvariables,but only one additional nonnull loading than
M 2
, whih is equivalent to amodel withthe sameparametersbut oneadditionallatentvariablewithor-
responding loadings equals to 0. We have the following results in terms of
empirialleveland power.
Hypothesis level empiriallevel empirialpower
Ω T Ω T
H 0
0.1 0.099 0.16H 0
0.05 0.050 0.11H 0
0.01 0.014 0.04H A (M 1)
0.05 0.75 0.57H A (M 2)
0.05 0.42 0.25H A (M 3)
0.05 0.12 0.22Table 5.1: Empirialleveland power for the model with 2 latent variables
Hypothesis level empiriallevel empirialpower
Ω T Ω T
H 0
0.1 0.103 0.045H 0
0.05 0.052 0.021H 0
0.01 0.018 0.002H A (M 1)
0.05 0.21 0.09H A (M 2)
0.05 0.10 0.06Table 5.2: Empirialleveland power for the model with 3 latent variables
Inthe twolatentvariablesmodel,
Ω
outperformsthe S&Bstatistisinterms of empirialleveland power. Theonlyexeptionistheempirialpowerwiththe
M2
model,butthis isdiulttoomparebeauseoftheliberalbehavioroftheempiriallevelof
T
.Ω
hasalsotheexpetedbehaviorintermofpowerwhen the testedmodel isfurther away fromthe null. The 3 latentvariables
model simulations onrm this results. Notie that in this ase the power
is lower beause the alternativeis loser to the null. Indeed, in the 3 latent
variablesmodel, weadd anonullloadingin aloadingmatrix ofsize
(10 × 3)
while thesize ofthe loadingmatrixinthe2latentvariablesmodel is
(5 × 2)
.Moreover, the level of
T
is muh smaller than the presribed nominal level0.05andthismakesthistesttooonservative. Theresultmightbeduetothe
small sample size used in this simulation study. Indeed, the p-value of
T
isestimated using itsasymptotidistribution. But,even whenthe sample size
islarger, onehas tobesurethat theassumptions ofthe underlyingvariables
approah are valid tobe able to ompute
T
in pratie.Clearly,thenewstatistiimprovesgoodness-of-ttestingwithinGLLVM.
At present,
Ω
and its assoiated p-values are estimated using R ode byalling C ode to ompute the loadings and the sores using the algorithm
presented inHuber, Ronhetti,and Vitoria-Feser (2004). Bothuse a quasi-
Newton proedure (Dennis and Shnabel, 1983) where ompatible stopping
ruleshavetobedenedandan beomputationallyintensive. Onepotential
improvement would be todevelop a stand-alone R library to avoidthis two
step proedure and more eient algorithmstoredue the omputing time.
Thisproedure anbenaturallyextended tolatentvariableswithovari-
between the distane among the latent sores minus the estimated funtion
of the ovariates and the orresponding distane among the original obser-
vations. Further researh diretions inlude the denition of new distane
funtions for nominal disrete variables, and non-normal distributed vari-
ables. Finally, our GFI and the proedure to evaluate its p-value ould be
extendedtomultilevelmodels,struturalequationsmodelsorombinedmea-
surement models. Indeed, the GFI an be naturally extended to that ase
usingthe generalizedfatorformulation(GF)of Skrondaland Rabe-Hesketh
(2004),but muh workneeds tobe donetoextend the estimationproedure
and the resampling tehniques to these ases.
Appendix A: Simulation Parameters
The loading with the value of 0 for the model under the nullhypothesis are
atually not estimated. This onstraint ensures a unique solution.
A.1: Model with 2 latent variables
Using (2.1), wehave the followingmodel
γ =
γ (1) E(x (1) | z)
.
.
.
γ (5) E(x (5) | z)
=
log
p (1) s 1 − p (1) s
.
.
.
log
p (5) s 1 − p (5) s
=
α (1) 0,s α 0,s (2) α (3) 0,s α (4) 0,s α (5) 0,s α (1) 1 α 1 (2) α (3) 1 α (4) 1 α (5) 1 α (1) 2 α 2 (2) α (3) 2 α (4) 2 α (5) 2
T
z
with the nullhypothesis,