Small area estimation with calibration methods
Risto Lehtonen (University of Helsinki) Ari Veijanen (Statistics Finland)
Outline
Preliminaries
Calibration weighting systems Monte Carlo experiments
Discussion Literature
This is joint work with my colleague Dr Ari Veijanen of Statistics Finland
2
Design-based calibration methods for domain estimation to be discussed
Traditional model-free calibration MFC
Deville J.-C. and Särndal C.-E. (1992), Särndal C.-E. (2007)
Deville J.-C., Särndal C.-E. and Sautory (1993) (CALMAR I,II,...) Estevao & Särndal (1999, 2004), Lehtonen & Veijanen (2009) Model-assisted calibration MC
Lehtonen & Veijanen (2012, 2016)
Wu & Sitter (2001), Montanari & Ranalli (2005) (Model calibration) Hybrid calibration HC
Lehtonen & Veijanen (2015)
Montanari & Ranalli (2009) (Multiple model calibration) Two-level hybrid calibration HC2
Lehtonen and Veijanen (2017)
Some key properties
(under complete response)
4
Questions of interest
Relative design-based properties of MFC, MC, HC and HC2
Accuracy properties
Distributional properties of calibrated weights
Comparison with model-based SAE
Design bias and accuracy of model-assisted calibration vs. model-based EB method
Main interest: What happens in minor domains (with small domain sample size)?
Empirical framework
Design-based simulation experiments Real population data
Mixed models
Target parameters
6
Domain proportions
, 1,...,
where our study variable is binary (poverty indicator) 1: in poverty, 0: otherwise
Populations and sub-populations , 1,..., domains o
d k
d k U d
d d
d
t y
p d D
N N
y
U U d D
1
f interest known domain size,
, 1,..., higher-level regions {1,..., ,..., } unit-level population
D
d d d
d d
N N N
R U d D
U k N
Sample data
sample from with sampling design ( ) inclusion probability,
1/ design weight
part of sample falling in domain (unplanned domains)
part of sample falling in hig
k
k k
d d d
d d
s U U p s
k U a
s s U s U
r s R s
Values study variable are measured for
her-level area 1,...,
of
We assume complete response
d
k
R
y D
y k s
d
s
U d
U
R d
s d
r d
Auxiliary data
1
0
, 1,..., with known vector value for
We assume access to unit - level auxiliary data for variables (
every : We usually add a
,..., )
1 in the ve value
For estimation purpose
ctor
k k Jk
j
k
x j
x x
x
J k U
x
s the sample data and auxiliary data are merged at the unit level by using unique identifiers
that are assumed available for both data sources
This option is available in increasing number of
statical infrastructures
Assisting mixed models
10
0 1
0 1
Logistic mixed model for binary study variable
exp( )
( ) , , 1,..., (1)
1 exp( )
where ( , ,..., ) , known for all ( , ,..., ) vector of fixed ef
k d
m k d d
k d
k k k Jk
J
y
E y u u k U d D
u
x x x k U
xβ xβ x
β
lme4
2
and
2from sample data set ( , GLIMMIX) Calculate estimates
fects
are domain-level random intercepts, ~ (0, ) Estimate
ˆ , 1,..., and predicted values ˆ ˆ
exp( )
ˆ
1 exp(
d d u
u
d
k d
k
k
u u N
u d D
y u
β
xβ
x ˆ ˆ )
dfor all k U d
d, 1,..., D
u
β
Calibration weighting system - 1
calibration weight for element
Calibration equations for single-level calibration methods (MFC, MC and HC) for domains
, 1,..., (2)
d d
di i i
i s i U
di
w d D
w i
z
z2
in domain generic calibration vector
Distance measure approach with a chi-square distance We m
( )
(3)
subject to the inimize
c
d d d
i
dk k
d di i i
k s k i s i U
w w
a
d
a
z
λ z z
alibration equations (2)
NOTE: Distance measure in (3) corresponds to GREG weighting
Calibration weighting system - 2
12
1 (4) where Lagrange coefficient vector
The e
s are quation (3) is minimized by weights
d d d
dk k d k
d i i i i i i
i U i s i s
w a
a a
λ z
λ z z z z
1
, 1,..., (5)
The resulting calibration estimator of domain total is of the form:
ˆ , 1,..., (6) where is metho
d
d
dCAL dk k
k s dk
d D
t
t w y d D
w
d-specific calibration weight
Calibration vectors for single-level
methods
Two-level hybrid calibration - 1
14
(1) (1)
0
(1) (1) (1)
0 (1)
0
Calibration equations for HC2
, ˆ MC part (lower level) (7) ( , ˆ ) , , , 1,...,
1,
d d d d
ri i i i i
i r i U i U i U
i i i d d d d
i
w x y
x y r s R R U d D
x i
z
z
z
0 (1)
0 1,
, 0 otherwise (extended variable)
ˆ ˆ
, , 0 otherwise (extended predictions)
ˆ ˆ ˆ
( ( )) predictions calculated for chosen GLMM ( , ,..., ) ,
d
i i d
i i d
i i j i Ji
U x
y y i U y f
x x x
xβ u x
(2) (2)
1
(2)
1
,..., MFC part (higher level) (8)
( ,..., ) ,
d d d d
d
ri i i i ji
i r i R i R i R
i i ji d
i U
w x x
x x i R
z
z
z
s
U d
U
R d
s
r
Two-level hybrid calibration - 2
16
2 (1) (1)(2) (2)
Minimize function
(9)
subject to calibration constraints (7) and (8 Equation (9)
) is
d
d d
d
i
rk k i i R
r ri
k r k i r i i
i R
w a a w
z z
λ z z
(1) (1) (1) (1)
(2) (2) (2) (2)
1 ,
minimized by weights
d d d
rk k r k
i i i i
r i i
i R i i r i i r i i
w a
a a
λ z
z z z z
λ z z z z
1
2
The resulting two-level HC estimator of domain total is given by
ˆ , 1,..., (10) NOTE: Weights for outside tend
d
dHC rk k
k r
d d
t w y d D
k r s
to be small
Estimators of domain proportions
_
Calibration estimators of domain proportions
/ , 1,...,
a) Horvitz-Thompson type estimators (using known ) ˆ ˆ
, 1,...,
d
d
d k U k d
d dk k
dCAL k s dCAL HT
d d
p y N d D
N t w y
p d D
N N
_
(11) b) type estimators (using estimate ˆ )
ˆ ˆ
ˆ , 1,..., (12) where are method-specific calibrati
Hájek
on w
d
d
d dk k
dCAL k s dCAL HA
d k s dk
dk
N t w y
p d D
N w w
eights as in (6) or as in (10)
Model-based EB predictor
Risto Lehtonen
18
1
EB estimator of domain totals:
ˆ ˆ
, 1,..., (13)
ˆ ˆ ˆ
where fitted values are ( )) with (1, ,..., ) ,
dEB k Ud k
k k d
k k Jk d
t y d D
y f u
x x k U
xβ x
_
and refers to the chosen member of the GLMM family EB estimator of domain proportions is:
ˆ ˆ
, , 1,..., (14)
Model-based
dEB HT dEB
d
f
p t d D
N
SAE: see Rao & Molina (2015) Small Area Estimation.
2nd Ed. Wiley.
Some known differences
EXAMPLE: Poverty rate for regions
Design-based simulation experiment with real data Fixed finite population of about 600,000 persons
Western Finland
Register databases of Statistics Finland
Regional hierarchy: NUTS4 (LAU1) regions within NUTS3 regions
Domains of interest: 36 NUTS4 regions Higher level regions: 7 NUTS3 regions SRSWOR sampling of n = 2000 persons
Limited simulation experiments
Calibration methods: K=1000 simulated samples Weight distributions: K=100 simulated samples
20
Variables
Study variable y:
Binary indicator with values 1=in poverty, 0=otherwise
European Union definition, one of the AROPE indicators: The poverty indicator shows when a person’s equivalized income is smaller than or equal to the poverty threshold, 60% of the median equivalized income in the population
Equivalized income variable was taken from taxation registers Overall poverty rate in population: 14.3%
lowest(NUTS4): 9.9%, highest(NUTS4): 22.4%
Auxiliary x-variables
Labour force status (3 classes) Gender (2 classes)
Age group (3 classes)
We generated five indicator variables for the x-vector (1,x x, ,x ,x ,x )
x
Estimators
22
_
= /
where and is binary poverty indicator
ˆ ˆ Estimators:
Target parameters: At-risk-of poverty rate in dom
/
where ˆ ,
ains
Weights
1,...,36
d
d
d d d
d k U k k
dCAL HT dCAL d
dCAL k s dk k
dk
p t N
t y y
p t N
t w y d
w
are method specific as in (6) or in (10)
Model-assisted estimators use logistic mixed model Estimators are of HT type (11) ˆ
exp( )
( ) ,
1 exp( )
d d
k d
m k d d
k d
p
E y u u k U
u
xβ xβ
Quality measures of estimators
1000
1
1000 2
1
Absolute relative bias (ARB) of poverty rate estimate: Table 1
ˆ 1 ˆ
ARB( ) ( ) /
1000
Relative root mean squared error (RRMSE): Table 2
ˆ 1 ˆ
RRMSE( ) ( ( ) ) /
1000 where
ˆ ( )
d d i d d
i
d d i d d
i
d i
p p s p p
p p s p p
p s
is an estimate from sample for domain
is known parameter value in domain , 1,...,36
i d
s d
p d d
24
Summary of results on calibration
Design bias: All estimators are nearly design unbiased Design accuracy
Major domains: All estimators show pretty similar accuracy Minor and medium-sized domains:
Model-assisted methods outperform direct model-free calibration Model-assisted calibration shows best accuracy and is preferred Hybrid calibration offers coherence property for selected
x-variables but can suffer from instability in small areas
Two-level hybrid calibration accounts for the instability and can provide a good compromise if coherence is required for some x- variables
NOTE: Preliminary results on Hájek type estimators (12) indicate that accuracy differences to HT type methods are small and exist in small domains and are systematically in favour of Hájek methods
26
Distributional properties of calibrated weights
Problems of practical concern in model-free calibration:
Possible large variation of weights Weights smaller than one
Positive but extremely small weights Negative weights
To what extent model-assisted calibration methods can help?
Small simulation experiment:
100 SRSWOR samples of size 2,000 elements from U Reporting:
Distribution of weights by domain size (log scale) - Figure 1
Medians of maximum interdecile range of calibrated weights in domain sample size classes - Table 3
Fig. 1. Distribution of weights by domain size
100 simulated SRSWOR samples, n=2000
28
Summary of distributional properties
Model-free calibration MFC shows worst performance
Model-assisted calibration MC stabilizes substantially the distribution of weights, in small domains in particular
Model-assisted calibration MC and two-level hybrid calibration HC2 indicate best weight performance
But: negative weights still remain Can we live with that?
Rather use other solutions?
--
Wu C. and Lu W.W. (2016) Calibration weighting methods for complex surveys. International Statistical Review 84, 79-98.
Gelman A. (2007) Struggles with survey weighting and regression modeling.
Statistical Science 22, 153-164.
Gelman: “Survey weighting is a mess.”
30
Relative error of MC and EB estimators in a certain large domain
Distribution of relative error
of design-based model assisted calibration MC estimator of poverty rate in large domain 64
NOTE
Nearly design unbiased
Outperforms EB in design bias
Distribution of relative error
of model-based EB estimator of poverty rate in large domain 64
NOTE
Design biased
Outperforms MC in design accuracy
32
ˆ ˆ
( dMC) ( dMC( )i d) / d
RE p p s p p RE p(ˆdEB) ( pˆdEB( )si pd) / pd
Literature
Deville J.-C. and Särndal C.-E. (1992) Calibration estimators in survey sampling. JASA 87, 376- 382.
Deville J.-C., Särndal C.-E. and Sautory O. (1993). Generalized raking procedures in survey sampling. JASA 88,1013–1020.
Estevao V.M. and Särndal C.-E. (1999) The use of auxiliary information in design-based estimation for domains. Survey Methodology 2, 213–221.
Estevao V.M. and Särndal C.-E. (2004) Borrowing strength is not the best technique within a wide class of design-consistent domain estimators. JOS 20, 645-669.
Gelman A. (2007) Struggles with survey weighting and regression modeling. Statistical Science 22, 153-164.
Guggemos F. and Tillé Y. (2010) Penalized calibration in survey sampling: Design-based estimation assisted by mixed models. Journal of Statistical Planning and Inference 140 ,3199–
3212.
Hidiroglou M.A. and Estevao V.M. (2014) A comparison of small area and calibration estimators via simulation. Joint Issue of Statistics in Transition and Survey Methodology 17,133-154.
Lehtonen R. and Veijanen A. (2009) Design-based methods of estimation for domains and small areas. In Rao C.R. and Pfeffermann D. (Eds.) Handbook of Statistics 29B. Elsevier, 219-249.
Lehtonen R. and Veijanen A. (2012) Small area poverty estimation by model calibration. Journal
Literature (contd.)
Lehtonen and Veijanen (2015) Small area estimation by calibration methods. WSC 2015 of the ISI, Rio de Janeiro, August 2015.
Lehtonen R. and Veijanen A. (2016) Design-based methods to small area estimation and calibration approach. In: Pratesi M. (Ed.) Analysis of Poverty Data by Small Area Estimation.
Chichester: Wiley.
Lehtonen R. and Veijanen A. (2017) A two-level hybrid calibration technique for small area estimation. SAE2017 Conference, Paris, June 2017.
Montanari G.E. and Ranalli M. G. (2005) Nonparametric model calibration estimation in survey sampling. JASA 100, 1429–1442.
Montanari G.E. and Ranalli M.G. (2009) Multiple and ridge model calibration. Proceedings of Workshop on Calibration and Estimation in Surveys 2009. Statistics Canada.
Rao J.N.K. and Molina I. (2015) Small Area Estimation. 2nd Ed. New York: Wiley.
Särndal C.-E. (2007) The calibration approach in survey theory and practice. SMJ 33, 99–119.
Wu C. and Lu W.W. (2016) Calibration weighting methods for complex surveys. International Statistical Review 84, 79-98.
Wu C. and Sitter R.R. (2001) A model-calibration approach to using complete auxiliary information from survey data. JASA 96, 185–193. (with corrigenda)
34