• Aucun résultat trouvé

Valeurs aberrantes et manquantes multivariées dans les sondages

N/A
N/A
Protected

Academic year: 2022

Partager "Valeurs aberrantes et manquantes multivariées dans les sondages"

Copied!
36
0
0

Texte intégral

(1)

Valeurs aberrantes et manquantes multivariées dans les sondages

Beat Hulliger, Tobias Schoch Fachhochschule Nordwestschweiz FHNW

CFS2012, Rennes

(2)

T HE AMELI P ROJECT

The goal to turn the EU into the most competitive and dyna- mic economy by 2010 demands a full benchmarking system to monitor policy performance and their impact on progress.

For this reason, the European Commission has engaged in selecting, collecting and analysing a set of indicators that are published each year. The Stockholm European Council has further emphasised the need for effective, timely and relia- ble statistics and indicators. A main challenge is to develop indicators for the main characteristics and key drivers. An ut- most important and challenging area to be measured is soci- al cohesion. Based on a clear definition of social cohesion, a universally-accepted high-quality and robust statistics to ade- quately measure social cohesion is required. Further, tools for measuring temporal developments and regional breakdowns to sub-populations of relevance will be of great importance. In order to measure social cohesion with Laeken indicators ade- quately while regarding national characteristics and practical peculiarities from the newly created EU-SILC, an improved methodology will be elaborated within AMELI. This will ensu- re that future political decision in the area of quality of life can be based on more adequate and high-quality data and a pro- per understanding of the Laeken indicators by the users. The study will include research on data quality including its mea- surement, treatment of outliers and nonresponse, small area estimation and the measurement of development over time.

A large simulation study based on EU-SILC data will allow a simultaneous elaboration of the methodology focusing on practical issues aiming at support for policy. Due to the fact that the Laeken indicators are based on a highly sophistica- ted methodology the project’s outcome may also serve as a methodological complement for other FP7 projects in the area of indicators.

P ROJECT I NFO

Project period: 1 April 2008 - 31 March 2011 Project No.: 217322

Workshops: Vienna, Spring 2010 Trier, Spring/Summer 2011

C OORDINATION Ralf Münnich

University of Trier Faculty IV

Dept. of Economic and Social Statistics Universitätsring 15, D-54286 Trier, Germany muennich@uni-trier.de

P ARTNERS University of Trier

Federal Statistical Office of Germany University of Applied Sciences NW Switzerland Swiss Federal Statistical Office

Statistics Austria Statistics Finland University of Helsinki Vienna University of Technology Statistical Office of the Republic of Slovenia Statistics Estonia

©2012 FHNW/Beat Hulliger 2/ 36

(3)

Contents

Introduction

Methods

Simulation

Results

Conclusion

(4)

Introduction

Introduction

(5)

Introduction

AMELI Project

I Advanced Methodology for European Laeken Indicators

I EU Framework Program 7 research project

I 10 Partners (4 Universities, 6 NSI), April 2008 - March 2011

I Development of new methods:

I Small Area Estimation

I Robustness

I Variance Estimation

I Evaluation with a large scale simulation study with controlled universes, sample designs and simulation environments.

http://ameli.surveystatistics.net

(6)

Introduction

EU-SILC Data

I European Statistics on Income and Living Conditions (EU-SILC)

I Coordinated Survey in all European countries

I Complex survey samples (stratified 2-stage cluster samples)

I Households and Persons

I 18 income components at the household level, 14 at the individual level

I non-monetary variables like employment status

(7)

Introduction

Income

Definition

An income concept defines the set of income component/

sources (e.g., employee cash income, capital income, unemployment benefits, etc.) and defines the

aggregation/composition rule to generate the income.

household

2

8

equivalized disposable household income

(8)

Introduction

EU-SILC Indicators of Social Cohesion

I Equivalized disposable income: HH-income divided by OECD revised equivalence scale attributed to each person of a household.

I Key poverty and income-inequality measures of EU

I Poverty threshold (PT): 60% of median income.

I At risk of poverty rate (ARPR): Proportion of persons below PT

I Quintile Share Ratio (QSR, S80/S20): Income share of

top quintile divided by income share of bottom quintile.

(9)

Introduction

Robustness

I How do multivariate outliers affect (univariate) indicators or, generally, a particular statistic?

I How do multivariate outliers in the income components affect the estimates of poverty and income-inequality measures?

I How can we successfully invoke Multivariate Outlier Detection and Imputation (MODI) methods?

I What is the impact of MODI on the poverty and

income-inequality measures of SILC?

(10)

Introduction

Multivariate Outliers in the Income Components

I Aggregation of income components may mask outliers.

I Aggregation of income components may build up to outliers.

I Marginal distributions of income components are skewed and zero inflated

I Structural zeros in multiple dimensions (an overwhelming majority of the observations lies in subspaces)

I Joint distribution is not elliptically contoured

(11)

Introduction

Multivariate outlier detection

I Robust methods based on Mahalanobis Distance with robust center and robust covariance: M-estimators, Stahel-Donoho estimator, Minimum Volume Ellipsoid (MVE) and Minimum Covariance Determinant (MCD), BACON...

I Methods based on data depths: Halfspace depth, simplicial depth, convex hull peeling...

I Computationally demanding.

I Need adaptations to cope with missing values.

I Need adaptations for survey data.

(12)

Methods

Methods

(13)

Methods

Parametric Detection: BACON-EEM

I BACON algorithm for complete, iid-data (Billor et al., 2000)

I BACON-EEM extended to work with missing values (EM-algorithm) and adapted for finite population sampling (Béguin and Hulliger, 2008)

I Model: elliptically contoured distribution, EC(µ ,Σ)

I Algorithm:

I Small subset G 1 of outlier-free data: center µ G 1 and scatter Σ G 1

I Mahalanobis distance (MD) of all points

I Choose a new (larger) subset, G 2 , of observations with MD below a threshold

I Iterate until convergence

I Output: µ ˆ R and Σ ˆ R and outlier indicator o i , i = 1 . . . ,n

(14)

Methods

Parametric Detection: GIMCD

I Use non-robust EM-Algorithm to impute: outliers may influence the estimates and outliers may be imputed.

I Use highly robust MCD to detect outliers (high breakdownpoint) to cope with original and imputed outliers.

I Note: neither of the steps takes sampling into account.

(15)

Methods

Parametric Imputation

Treat Outliers as if they were missing values (TOaM)

I set outliers to missing

I impute for missing observations from EC(µ R , Σ R )

Winsorized Gaussian Imputation (WGI)

I impute y ˆ i = µ R + (y i − µ R ) · (c/d i ), with c a tuning constant and d i the MD of observation i under model EC(µ RR ) (Hulliger, 2007)

I preserves direction

(16)

Methods

Zero Observations

I With BACON-EEM and GIMCD, set 0 to missing for detection? ⇒ Yes!

I Recover 0 before or after impuation? ⇒ Before!

(17)

Methods

Non-parametric Detection: Epidemic Algorithm (EA)

I Non-parametric data depth (Béguin and Hulliger 2004)

I Algorithm:

I Start an epidemic at a central observation (e.g., spatial median) and let it spread through the point cloud.

I Late infected observations are outliers.

I Transmission/ infection probability decreases with distance (Tuning!).

I (Euclidean) distance with missing values.

I Infection probabilities estimated in population (w.r.t.

sampling design)

I Output: outlier indicator o i = 1 if infection time large or not

infected.

(18)

Methods

Non-parametric Imputation Reverse EA

I Requirement: outlier indicator o i , i = 1, . . . ,n (declared outliers)

I For each declared outlier and obs. with missing values:

1. start an epidemic at the outlier/obs with missings 2. stop when potential donor(s) infected

3. impute values of donor

I Tuning of REA requires some experience

I Preserves neighbourhood

(19)

Simulation

Simulation

(20)

Simulation

Simulation set up

I AAT-SILC universe from AMELI project: 8 Mio persons with 27 variables.

I ST SI sample, 6000 households (approx. 14000 persons); 1000 Monte Carlo replications;

I 1% outliers ≈ 127.9 outlier-persons. Outlier completely at random (OCAR), contamination not at random (CNAR).

I 0 and 2 % missing values completely at random (MCAR).

I R-packages simFrame and DBsim.

(21)

Simulation

Data Processing

Simulation 1

sampling

Prepare 1

aggregate

Prepare 2 treatment of zeros

transform (symmetry) Prepare 3

contaminate (outliers) Simulation 2

missing values Simulation 3

detect MODI 1

impute MODI 2

back- transform Prepare 4

equivalized disposable income Prepare 5

estimation finally A-AT-SILC

population of Alfons et al.

(2010) ST SIC

work inc capital inc transfers pers transfers hh

remember which obs is zero exclude/ set to NA

component- wise transform by e.g., log

10

Replacement of a scaled point cloud shift outliers

MCAR MAR

BACON-EEM Epidemic Algorithm

TOaM WGI REA

mean

at-risk-of-poverty rate relative-median poverty gap work inc

capital inc transfers pers transfers hh

©2012 FHNW/Beat Hulliger 21/ 36

(22)

Simulation

Data Processing (ctd.)

Simulation 1

sampling

Prepare 1

aggregate

Prepare 2

treatment of zeros

transform (symmetry) Prepare 3

contaminate (outliers) Simulation 2

missing values Simulation 3

detect MODI 1

impute MODI 2

back- transform Prepare 4

equivalized disposable income Prepare 5

estimation finally A-AT-SILC

population of Alfons et al.

(2010) ST SIC

work inc capital inc transfers pers transfers hh

remember which obs is zero exclude/ set to NA

component- wise transform by e.g., log 10

Replacement of a scaled point cloud shift outliers

MCAR MAR

BACON-EEM Epidemic Algorithm

TOaM WGI REA

mean

at-risk-of-poverty rate relative-median poverty gap quintile share ratio Gini coefficient work inc

capital inc

transfers pers

transfers hh

Références

Documents relatifs

We first investigate the problem of source decomposition of inequality measures, the so-called additive income sources inequality games, baed on the Shapley Value, introduced

Table 3.7 provides the structure of wealth for the development regions in 1995 and 1996. The wealth is estimated for the households of development regions in the same way for

Interestingly, the unobservables in the equation for the wife’s paid work time correlate positively and significantly with the unobservables in husband’s childcare time, suggest-

Keywords: Copula, Dependence, Measure of association, Extreme values theory, Tail index, Kernel estimator, Heavy-tailed, Risk premium, Bias reduction, Asymp- totic

It is, in fact, found possible to develop an elegant and powerful formalism of quantum mechanics based on the above concepts of 'I' as a vector and the dynamical

To account for the large phenotypic diversity in viral populations, we consider a continuous spectrum of virulences and present a continuum limit of this multiple viral strains

weighted household post- government income, individual net income, corrected monthly household income, and household net income from wages) on subjective health were compared in

Since