Practical issues in handling life course data

4.2 Observations

4.2.2 Practical issues in handling life course data

During the immersion with Andr´es Guarinand Myriam Girardin, data prepa-ration took an important part of the time devoded to the study. We had to deal with life course data, more precisely panel data and netword data, and a lot of data processing operations were required to make data ready for analyses. This Section review the methods and practices defined for managing life course data in statistical software.

4.2.2.1 Life course data

The term life course data is used in a number of life course studies and mainly refers to longitudinal data. Beyond that, the term life course data does not have

a clear definition. I clarify here the scope of the term life course data on the basis of the review of the life course perspective approach reported in Section 2.1.1.

The life course perspective is a holistic approach that aims to study indi-viduals taking into account that they are delved into both the flow of their life, the flow of relative’s lives and several temporal and environmental contexts. Re-garding data in connection with individuals, the life course perspective focuses on three aspects: biological aspects, psychological aspects, and social aspects. Data can include complex measurements such as pulmonary radiographic images when studying lung cancer or brain images when studying the evolution of a cerebrovas-cular accident or a psychiatric pathology (Buckner et al., 2006). Such structured data can be processed through qualitative analyses or transformed into several cate-gorical or quantitative indicators to be included in quantitative analyses. However, on their simplest form, individual-related aspects directly refer to standard cat-egorical or quantitative variables, each one modeling a one-dimensional measure.

Biological aspects include for example age, sex, weight or blood pressure. Psycho-logical aspects include for example well-being, mood, or irritability. Social aspects include for example income, socio-economic position, political positioning.

Such data are most often collected through a survey. Survey data forms the empirical base of much of the research. It is mainly the case for research into vulnerability across the life course being carried out by the Swiss National Centre for Competence in Research LIVES (Roberts et al., 2016). In its simplest form, a survey collects information about some individuals without regard to differences in time. In this setting, a survey refers to a snapshot of the sample at a specific time.

The resulting database of such a survey will not provide the ability to identify the time at which each respondent was surveyed. Such a database is called cross-sectional (Lavrakas, 2008). The cross-sectional nature of a database comes from the survey design or data source used but is also an assumption made on the data.

Indeed, in most cross-sectional surveys, individuals are not surveyed exactly at the same time. The tacit assumption made is that it is not necessary to account for time differences. In other words, it is assumed that interviews are conducted close enough so as the time does not matter. Basically, cross-sectional survey data consists of a sample of cases (individuals, houses, companies, etc.) from which some information has been collected and represented by several variables. The resulting database is usually stored in a two-dimensional array.

While analysis of cross-sectional data usually consists of comparing the dif-ferences among the subjects, the life course perspective goes beyond by taking time into account. From a general perspective, taking time into account when studying individuals refers to have several measures of individual characteristics at different times and studying the variations of these measures over time (Hsiao, 2014). Longitudinal data often distinguish between panel data and biographical calendar data. Panel data store repeated measures called waves (Hsiao, 2014).

Panel data differ from pooled cross-section data across as they measure the same subjects at different times whereas the latter measure different subjects at different times. By contrast, in retrospective designs, respondents are interviewed once to collect life calendar data on their past life courses. With such a design, life calen-dars are focused on events and avoid attrition often observed in panel data. Life

4.2. Observations 107 calendars have been recently shown useful when studying vulnerable population (Morselli et al., 2016) Each life course can be seen as a sequence of life events:

birth, important disease, recovering from a disease, starting school, ending school, first job, first union, leaving home, first child, the death of the father, marriage, etc. (Ritschard and Oris,2005). Sequences of the various family, education, work, health, emotional, and other personal events that define a life course. (Ritschard and Oris, 2005) The life course perspective also considers the social environment of individuals. This social environment is a multilayer environment. In Section 2.1.1.3, I focus on the Bronfenbrenner model to illustrate this layering. In term of data type, the distinction that has to be made concern the micro-social environ-ment and the macro social environenviron-ment. Considering the micro-social environenviron-ment, data can be stored as matrix representing the associations with network members (E. D. Widmer et al., 2013; Kogovˇsek et al., 2014). Considering the macro social environment, data can be stored as standard 1-D variables storing each a distinct aggregated measure. Common macro-social variables refer to the average fertility rate, average unemployment rate, average life expectancy (Billari, 2015).

To sum up, life course data involves a variety of data type including panel data (population time, individual follow-up), biographical calendar data (individ-ual time, individ(individ-ual follow-up), successive cross-sectional data (population time, no individual follow-up), administrative/monitoring data (population or individ-ual time, individindivid-ual follow-up), network data, egocentric network data, qindivid-ualitative interviews.

4.2.2.2 Data handling methods and practices

By complexifying the structuration of data collected, life course data bring difficul-ties in term of data management. This Section review the software functionalidifficul-ties available for handling life course data specific characteristics.

Biographical and panel data

Statistical software was firstly designed to store cross-sectional data and as we saw above the standard way of storing data is to use a two-dimensional table with individuals in rows and variable in a column. When longitudinal data emerged, keeping using a two dimensional table was natural. Considering biographical data, variables are not different from cross-sectional variables. Variables store dates instead of characteristics, the characteristic being defined by the variable itself. For example first marriage. For each event, we have a single measure per individual:

the date.

Considering panel data, it is different. Panel data store repeated measures.

Panel data requires dealing with several measures of the same variable. To store such data in a two-dimensional table, a convention for organizing data has to be defined. The two common ways of storing panel data are the person-level format and theperson-period format (Singer and Willett,2003). The person-level format store one record by case and multiple variables storing all the measurements taken at the different measurement occasions. In a person-level database, there are as

many rows than there are cases. On the opposite, the person-period data format store multiple records by case and one column by variable. An additional variable store the date of the record. In a person-period database, there are as many rows than there are cases and waves.

While this review suggests the use of the person-period format, we observe practically that databases (Taylor et al., 2010; M. Voorpostel et al., 2012a) are released on person-level format. My analyze is that (1) statistical software is gen-erally designed for storing data in a person-level format and (2) the label of the measure can change. For instance, the variable health status could be during sev-eral years “How do you assess your health?” then becoming “How do you assess your health comparing to your peers?”. This change in the phrasing of the question asked could make the variable non-consistent across time. This is let to the user to decide whether this variable has to be considered as two different measurements or not.

In the immersion conducted in collaboration with Andr´es Guarin, the task of manipulating data and making data ready for analyses consumed a large part of the time we devoted to the study. In particular, because of time limitations, we were unable to end the second part of our study that discussed occupational attainment. More precisely, analyses were run but we lacked time for ending result interpretation. It is clear that by saving time on data preparation we could do more on result interpretation and maybe end with even more outcomes than we were able to do in the current version of the study. And even without ending with new results in the article, having more time to focus on result interpretation would provide us with a better understanding of the situation experienced by second-generation immigrants, that could serve as a basic knowledge for our future research works.

Network data management

Many data preparation steps are required for being able to start analyzing data and performing some descriptive analyses. They are also required to perform network specific analyses such as centrality, density, in-and-out degrees. But, the immersion showed that practitioners need to move in even more complex data recoding opera-tion to match the needs of their research studies. As an example, a hypothesis could be that if people in the network has a low educational level, then the density of the network is more important. Another question could be linked to demographic information about the second-most centrality person in the network. Alternatively, the management of missing values when computing so network measures. Remov-ing all individual from the networks havRemov-ing specific demographic values: removRemov-ing all men and computing the density again. For example, to test the centrality of the brother in a network a possible strategy is to remove all the brothers and com-pute centrality again. Such questions required the use of both data network and demographic data about network members. Collecting that information from a 2D tabular database takes several operations and requires the practitioner to leave the study for focusing on data management operation. With regards to the familiarity of the user with such operation, they may also require the user to spend some time to train herself. For example, for the Geneva VLV database: (5 × 5) × 4 + 5 × 8

Dans le document Modelisation and Information System Tools to Support the Discovery of Interactive Factors of Vulnerabilities in Life Courses (Page 124-128)