Practical issues in data weighting - Modelisation and Information System Tools to Support the D

4.2 Observations

4.2.1 Practical issues in data weighting

Social science research strongly rely on survey data. Survey data form the empirical base of most quantitative studies. Survey data generally come from questionnaires that are distributed to interviewees in order to measure some aspects of their life.

As asking the whole target population is generally too costly, only a subset, the sample, is being interviewed. The findings from this subset are then generalized on the entire population using statistical methods. For being able to generalize from sample to population, the sample has to correctly represent the population studied. A representative sample is a sample which, for a specified set of variables, makes certain specified analyses yield results within acceptable limits set about the corresponding population values (Stephan and McCarthy, 1958). Assuming each population member has a known, non-zero chance of inclusion, such a representa-tive sample can be achieved by running a probability sampling (Blair,2009). In a probability sampling, sample members are drawn with a random selection mech-anism according to their respective inclusion probability. But practically, setting up a list of people to interview that correctly represents the target population is hard to achieve. And even supposing that such a list was successfully set up, there is no guarantee that it will be possible to reach all of them or that they all accept to answer the survey and to answer all questions of the survey. These reasons lead to survey non-response and the occurrence of missing values.

Literally, a missing value is an unknown value of a given observation on a given variable. This definition remains simple as long as one does not question why the value is unknown. Indeed, the issue with missing values is not so much their ab-sence as the cause of their abab-sence. Under the assumption of a completely random distribution of the missing values (case called Missing Completely at Random), both moment analyses (average, standard deviation, etc.) and multivariate associ-ation analyses can be conducted without worrying about missing values. However, this hypothesis is in practice difficult to state. A weaker hypothesis is to formulate that the reason why a particular value is missing does not depend on the value itself, but only on variables that are all observed (case called Missing at Random).

Such a case occurs, for example, by assuming that men are more likely to com-municate their weight than women, and that once having controlled by sex the missing values are distributed completely at random. In this case, it is possible to carry out multivariate association analyses without worrying about the missing values, but not the moment analyses (which will require to model the nonresponse mechanism, for example by the use of an adequate weighting). In practice, this assumption is often made. However, this assumption is not always acceptable. Re-futing this assumption means that one consider that the set of variables observed is not able to explain the missing values on the other variable. The reason might be that the right variables are not observed in the sample. Another reason can be that the reason why values are missing depends on the values themselves. Such a case might be observed, for example, for income variables. Some people with too extreme incomes (very high income or very low income) may prefer not to report their income. Similar phenomena also take place in longitudinal studies. For

ex-4.2. Observations 103 ample that a teenager sort of himself from a study on obesity by noting that he has gained weight. This type of non-response cause (case called Missing Not At Ran-dom) is non-ignorable to obtain precise estimates or even simply valid estimates.

Unfortunately, this is also the most challenging case to deal with. Since the model that generates the data is most often not known, the strategy is to use imputation methods.

A common distinction is between unit non-response and item non-response.

Unit non-response occurs when someone did not accept to answer the survey while item non-response occurs when someone answered the survey but did not answer to some questions. Unit non-response often makes the sample fail to correctly rep-resent the target population of a particular study. And data processing operations successively performed on data suffering from item non-response may lead to de-grade even more data representativeness. To address unit non-response, database managers compute sampling weights intended to correctly account for specific fea-tures of survey design, such as stratification, under/over-representation of a partic-ular group of individuals, and for potential bias. Basically, the use of non-response weights consists in representing absent individuals by present individuals. In the remainder of the chapter, I often use the terms “weights” to refers to a variable providing sampling weights. However, it is worth to mention that, for a particular sample, sampling weights are not unique. A lot of different sampling weights can be computed using different techniques, involving different assumptions and target-ing different aims. This situation is even more prominent. In longitudinal studies, several weighting variables are most often available to allow practitioners to target a particular population. Cross-sectional weights are available to make the sample representative of the population of a particular year, independently of the attri-tion, while longitudinal weights are available to make the sample representative of to population of the first wave of the survey.

As pointed out above, survey non-response is usually considered as signifi-cant bias. Indeed, in most cases survey non-response does not occur randomly:

people who experience difficulties or changes in their lives, such as health, family or unemployment issues, are less encline to answer a survey than people enjoy-ing a stable and comfortable life (Marieke Voorpostel and Lipps, 2011). In this context, it would be an hard assumption to consider missing values due to survey non-response as missing at random. For these reasons, the use of sampling weights is stated as mandatory in the documentations of several leading surveys, as for in-stance the Swiss Household Panel (M. Voorpostel et al.,2012a), the Swiss Labour Force Survey (Swiss Federal Statistical Office, 2012) and the British Household Panel Survey (Bentley, 2013). It is important to note that this issue is even more prominent when studying vulnerable populations that are typically hard to reach and often require a specific attention when designing the survey (Oris et al.,2016).

However, I observed that in R, the possibilities offered for weighting data are not explicity. Actually, a lot of functions, including basic functions aiming to compute some descriptive statistics such as the average or the quartiles, do not offer the possibility to use weights. As another example, the standard logistic regression method (glm) does not consider weights either. As I found this result suprising, I compared with two other statistical software. Here are the results:

• R: Not native although there exist several procedures that are able to handle sampling weights. For a unified solution for handling sampling weights, users have to rely on a contributed package, as for instance thesurvey package (Lumley,2004,2011).

• SPSS: Native for all procedures, but not used by default. Using sampling weights in estimations requires thecomplex samplemodule but users are not prompt if they don’t use survey estimation procedure although they should.

• Stata: Native, but not used by default: Using sampling weights in estimations requires the use of specific commands (accessible by using the survey prefix commandsvy, see (StataCorp,2013a) for more information) but again users are not prompt if they don’t use survey estimation procedure although they should.

The fact that majors data survey institution state that the use of weights is manda-tory to achieve representative results and the fact that the use of weights is not immediate in statistical software seems conficting. I discuss and attempt to address this issue in Section4.3.1.

The term “weights” may take different meaning depending on the software or software functions the user is working with. Indeed, among some other types of weights, such as importance weights, imputation weights or prior weights, an analyst has generally to deal with four type of weights:

• Sampling weights, also calledsurvey weights, are the weights we are talking about in this Section. They balance the sample to make it representative of the target population. Sampling weights refer to the inverse of the probability of particular observation to be selected from the population to the sample (Hilbe and Robinson, 2013). For example, a weight of 3 is given to a case whose the probability to be sampled was ¹₃ (UCLA Statistical Consulting Group,2015). This ensures the case is correctly represented among the other cases of the sample.

• Population weightsare used to make the sample representative to the targeted population and with a total number of weighted cases equals to the total number of individuals in the population. Such weights are not intended to be used for statistical analysis but for making projections on the whole population (M. Voorpostel et al.,2012b).

• Replication weights, also calledcase weights or frequency weightsare integer weights used by statistical software to save storage space and/or reducing computation times. For instance, a weight of 3 means that there were actually three identical observations in the primary data, which were collapsed to a single observation in the database (UCLA Statistical Consulting Group, 2015). Shrinking data that way avoids replicating the same information and reduces the number of operations needed when running a computation.

• Precision weights, also calledanalytic weights, are used to represent the accu-racy of the measure of an outcome variable. They aim to model differential precision with which the outcome variable was estimated, using averages based on different numbers of observations. For instance, a precision weight of 3 means that the case is actually the average of 3 observations. As the more measurements used in the average, the more precise the average is, the

4.2. Observations 105 associated weight will be proportional to the inverse of the variance of the used mean (R Core Team,2015, glm function manual page).

I observed during the empirical observations conducted with Andr´esGuarinand MyriamGirardin that these various weight types tend to confuse practitioners.

In addition, statistical software are not always clear about the type of weights they are able to deal with. For instance in R, the functionglmfrom the package stat(R Core Team,2015) implements precision weights and the functionsurvfit from the package survival (T. Therneau,2012) implements replication weights, but both parameters associated with these weights are just called “weights” in the respective manual pages. We let the user note that although both methods are methods extensively used by researchers working with survey data, no one imple-ments sampling weights. SPSS is also able to weight data thanks to the function call WEIGHT BY [...]. Literally, this function indicates we want to weight data by a certain variable, but does not specify what type of weights are intended to be used. Actually, this function call is intended to be used to weight data with frequency weights only, not sampling weights. Indeed, SPSS Base and SPSS Ad-vanced Models only support frequency weights (IBM Corporation,2013b, p. 2025).

To be able to use sampling weights in SPSS, users have to buy thecomplex survey module. I attempt to address this issue in Section4.3.1

In the study conducted in collaboration with Stéphane Cullati, to account for the fact that some survey databases were involved in multiple studies we re-viewed, such as the British Household Panel Survey or the Panel Study of Income Dynamics, we calculated weighted proportions, using the Stata svycommand, to cluster the 45 studies within all the survey databases involved. This stage of the study led Stéphane Cullati and me to discuss the use of weights in general and the use of sampling weights in particular. The point of view of StéphaneCullatiwas that the use of sampling weights is not always the best choice, as weighted analy-ses may achieve less significant results than unweighted ones. This point of view, apparently shared by several of his colleagues, was in obvious contradiction with what I read in the documentations of the leading surveys I studied. I investigate the question in Section4.3.1.

Dans le document Modelisation and Information System Tools to Support the Discovery of Interactive Factors of Vulnerabilities in Life Courses (Page 121-124)