• Aucun résultat trouvé

6.2 Identifying underlying factors in a semi-automatic way

6.3.1 Assessment setting

1.2.2.2 Identifying interaction effects with infrequent outcomes . 9

In data analysis or statistics, when analyzing the effect of some explanatory factors, also called predictors or independent variables, on a variable of interest, also called dependent variable, response, or predicted variable, the term interaction refers to a situation in which the impact on the dependent variable of one independent variable is different depending on the values of another independent variable. Another possible formulation is that the effect of one independent variable on the dependent variable is not the same at all levels of the other independent variable.

3I formalize these dimensions in Section3.1.1

At the beginning of this chapter, I implicitly introduced an interaction effect when discussing the example of the leg fracture caused by a domestic accident.

For young people, it can be assumed that such an event will not have significant consequences for the individual: a few weeks of rest are often sufficient to recover.

But for elderly people, recovering from such an accident is often harder. In the worst cases, this domestic accident can precipitate the transition of the individual to the loss of independence. This means that there exists an interaction between the variable “experiencing a leg fracture” and the age.

From a more general point of view, there are two reasons for which I expect the study of vulnerability in life courses to involve interaction effects.

Firstly, in the life course perspective, individuals are studied in taking into account various aspects of their lives including biological, psychological and social characteristics. In addition, individuals are not studies in isolation but as entities delved into several contextual environments including a micro social environment, and macro social environment, a geographical environment and several dimensions of the time including the historical time, the individual time and the social time.

Such a holistic approach involve the use of a higher number variables in the analysis.

By design, increasing the number of variables increases the number of possible interactions effects.

Secondly, as the title of the NCCR LIVES project “Overcoming vulnerability:

life course perspective” suggests, the overall objective of studying the vulnerability in life courses is not only to understand vulnerability better but also to find so-lutions to avoid vulnerability or to reduce vulnerability. On a data analysis level, a possible strategy to do this is to look for interaction effects between a previ-ously identified factor of vulnerability and another covariate that allow to disable or reduce the outcomes related to the factor of vulnerability.

In addition, it is relevant to note that outcomes resulting from vulnerable situ-ations are possibly infrequent, and sometimes rare, in a general population. Indeed, modern societies are organized around a number of laws and public institutions that aims to protect individuals against a number of life hazards. For instance, labor laws provide individuals with support to prevent falling unexpectedly in unemploy-ment. National and regional employment offices provide unemployed individuals with financial support and accompany them in the steps of finding a new profes-sional position. This organization of the society minimizes, to some extent, the risk of falling in undesired life situations. As a hopeful consequence, outcomes resulting from vulnerability states are often experienced by a small proportion of the pop-ulation. For instance, the unemployment rate of economically active youth living in Switzerland and aged 20 to 24 has been shown to be, according to the country of origin, between 3% and 11% in 2000 (Fibbi et al.,2006). The share of the Swiss resident population below the absolute poverty threshold in term of disposable in-come (i.e. the gross household inin-come subtracted by compulsory expenditure such as social insurance contributions, taxes, basic health insurance premiums, alimony and other maintenance payments) in private households has been estimated by the Federal Statistical Office to be about 6.6% in 2014 (Swiss Federal Statistical Office, 2016). In the United States, the probability of developing invasive cancer for free of cancer citizens aged 40 to 59 has been estimated by the American Cancer

Soci-1.2. Research framework 11 ety to be about 9% in 2012 (Siegel et al., 2012). These examples show that when studying vulnerability on a general population of an industrialized country, we have to expect an underrepresentation of the outcomes experienced by the population.

On an analytical level, this underrepresentation entails an imbalanced distribution of the dependent variable4. The imbalance among classes of the dependent variable often leads to a poor prediction rate of the minority class. This issue is well-known in the literature as theimbalanced data issue. However, the minority class appears to be in most cases the class of interest (the class of vulnerable individuals) and for this reason, a high recall rate is desired on this class.

However, it has to be noted that a vulnerability outcome is not always asso-ciated with an underrepresentation in data. In particular, the individualisation of the society that has been taking place in the past decades leads individuals to have to assume more often personally challenges and failures. In addition, difficulties faced by individuals in their lives have “democratized”. In particular, rather than being differentially distributed between social classes, the emergence of stressors and observable outcomes vary according to periods of life (Leisering and Leibfried, 2001; Spini et al.,2013,2017; Oris,2017).

In addition, according to the nature of the phenomenon studied, socio-economic parameters and the targeted population, the magnitude of the imbalance could be strongly reduced in particular situations. For example, in the past decades, Spanish unemployment rose dramatically. Observed at a reasonable level of 5 percent during the first half of the 1970s, Spanish unemployment successively increased to reach 24 percent during the 1990s (Dolado and Jimeno, 1997). In extreme situations, such as significant outbreaks, the rate of affected people by a particular outcome can rise even higher. For example, during the Medieval Black Death, vulnerability to contracting the plague was unfortunately almost balanced in the population as it is reported that up to 50-60 percent of the European population was killed by the disease between 1347–1351 (Benedictow (2004, p. 383) and DeWitte (2014)).

Nowadays, comparable vulnerability rates about, for example, poverty or insecu-rity occur in specific areas such as in impoverished cities as Detroit, Michigan, (C. A. Wilson, 1992) or refugee camps as Sangatte, France, (Schwenken, 2014).

In this context, I will pay attention that the contribution I will put forward for exploring variable interactions with infrequent outcomes will be able to work in both balanced and imbalanced contexts. That is, in other words, being balance insensitive.

Common methods used to identify interaction effects in classification include, but not exclusively, classification trees methods (Kass,1980; Breiman et al.,1984), regression methods (McCullagh and John A Nelder,1989; John Ashworth Nelder and Baker, 2004), bayesian networks (Pearl,1986) and association rules (Agrawal et al., 1993). In the present work, I focus on classification tree methods. I make this choice on the basis of two criteria: (1) the ability to identify interactions among a large number of variables and (2) the ability to represent the interactions identified in such a way as to allow a quick diagnosis by practitioners regarding

4Such an imbalance also affects the observation stage: as there are fewer occasions to observe such vulnerable outcomes, researchers have less empirical examples and pieces of evidence allowing to figure out what successions of events or transition led to the outcome.

their relevance in connection with their theoretical model. Regression models are probably the most used tool in the social sciences. This would be the preferred tool for a thesis aimed at providing methodological tools for social scientists. Regres-sion methods allow interaction effects to be highlighted provided that continuous variables are centered to avoid multicollinearity issues (Hayes, 2017). However, in a regression model, interaction effects consume a large number of degrees of freedom. Therefore, to ensure model convergence, it is necessary to limit both the number interaction effects of testing as well as the number of orders of these interaction effects. Therefore, instead of using hypothesis-testing based model, I recommend to explore the space of predictors with a statistics-free method. Also, as part of this thesis, we are interested in situations of vulnerability, and therefore the expected target variables correspond to potentially less frequent life situations (poverty, stress, exclusion, etc.). A small number of observations on the class of interest increases the model convergence issue and therefore further limits the number of interactions that can be tested simultaneously. This effect will be even more pronounced for predictor variables with a large number of modalities. When observations are broken down into a lot of classes, it is more difficult to obtain significant results than when classes are correctly grouped together. Classification trees provide a solution to this point: by performing recursive and step-by-step partitioning of the population, a large number of possible groupings are tested suc-cessively. Moreover, the fact that the splits of a same level of the tree are built independently of each other, raises the emergence of interaction effects. Bayesian networks are an effective tool for identifying conditional dependencies between the set of descriptive variables and thereby can identify interaction effects. In partic-ular, bayesian networks can simultaneously consider expert knowledge by acting a priori on the structure of the graph as well as the empirical evidence contained in data (Heckerman et al., 1995). On their side, association rules allow identifying associations among the frequent sets of co-occurrences of modalities according to different measures of interest. By comparing the rules to each other, and especially when the search involves both positive and negative rules (Wu et al., 2004), it is possible to identify interaction effects. However, to quickly identify an interaction effect, several types of information have to be made available to the practitioner, including how the class distribution of the target variable changes according to the values taken by the modalities of the predictor variables. The presentation of the results must also be done so as to not overwhelm the practitioner with too much in-formation. However, Bayesian networks and association rules produce outputs that are often difficult to interpret (Bayat et al.,2009; S. Kotsiantis and Kanellopoulos, 2006). Regression models are also more complex to read in a multinomial context.

In contrast, decision trees can render both the splits and the distribution of the de-pendent variable within each node making easy for practitioners to assess changes at each level of the tree. Such an intuitive graphical representation of the results allows practitioners to easily identify interactions, even of multiple orders. Another concern when working with life course data is the ability to handle temporality.

Decision trees are able to handle temporal data. Considering longitudinal data organized in successive waves, there are two ways commonly used for representing data in a tabular way: the wide format and the long format. In the wide format, each row refers to a unique individual and the same variable measured at different

1.2. Research framework 13 times is stored as separated variables. In the long format, each row refers to a unique individual and time measurement while each variable is stored in a single variable. Considering biographical data, coming for instance from a retrospective survey or life calendar, a long data format is often adopted. With data stored in a wide format, the classification tree treats each variable independently to the others.

Temporality links that exist between variables representing successive measures of the same item are not taken into account when growing the model. However, the tree is able to extract from the whole set of variables the time points that maximize classification quality. But, to keep a reasonable size, the number of variables that are in play has to be limited. As a result, only the most significant association emerge, and a number of other relevant associations may be kept hidden. When using a classification tree on data stored in a long format, the stress is placed on the variables themselves and the temporal information is used to assess whether temporality moderate the effects of a variable. Therefore, classification trees are able to handle longitudinal data but not to take into account all information about temporality.

Therefore, to address the second research question, I conducted a literature review on classification tree learning in the context of infrequent outcomes. This literature review is reported in Section2.2.

1.2.3 Immersions as a social scientist

The research questions introduced in 1.2.2.1 and 1.2.2.2 focus on the exploratory analysis step (G/G’) of the data exploration stage (steps E to G/G’) introduced in Figure 1.1. To address the data exploration stage globally, I also investigate what improvement can be put forward regarding the data understanding (E) and data preparation steps (F).

As the thesis focus on the discovery of factors of vulnerability in life courses, a particular point of interest concerns the use of life course data. Life course data are expected to be more complex than cross-sectional data traditionally used in social sciences. As the life course perspective involves studying trajectories, life course data are expected to contain measures repeated over time. This feature makes database larger and, as a result, more difficult to handle. Additionally, although repeated measures inherently share several characteristics, they may also differ in some other characteristics. Indeed, the survey design is likely to change over time.

For example, the phrasing of some questions may change to make them compatible with the national survey of another country. The rating scale of a variable is also likely to change from, say, a 7-item scale to a 5-item scale for a similar reason.

Such changes make the variables not directly comparable. As a result, additional preprocessing operations may be required to make data ready for analysis. The life course perspective also involves studying the microsocial environment of individu-als. The microsocial environment is made of a lot of linkages that involve the use of egocentric network data to be analyzed. The macrosocial environment plays also an important role in the life course perspective. Taking the macrosocial environ-ment into account in the analysis involve the use of administrative socio-economics data.

Therefore, I expect that both the increases in volume and the use of differ-ently structured data complicate the task of both understanding and preparing data. Taking both an information system and data analysis point of view, my proposition is to investigate what technical difficulties a researcher in social sci-ences experisci-ences due to statistical software limitations. To get this understand-ing, a quantitative or qualitative approach can be used. A possible quantitative approach is for example to administrate a survey. A possible qualitative approach is for example to participate to practitioner’s activities and make observations. I choosed this latter option as it allows to start exploring needs with no assumption and to successively orientate observation choices based on the results of the previ-ous observation stages. In addition, one of the innovative strategies of the NCCR LIVES is to encourage interdisciplinarity. Being the only one IT researcher within a high number of researchers in social sciences, my strategy was therefore to immerse myself in the role of a social scientist during the first two years of this PhD thesis to get a better understanding of the issues practitioners face in their daily work.

To this purpose, I started three collaborations with researchers in social sciences in three different domains: health sociology, labour sociology, and family sociology.

These three collaborations have in common their study a either a situation or a group of people seen as vulnerable and they follow a life course perspective.

Regarding to data understanding, the observations led me to focus on data documentation access within statistical software and on the use of sampling weights to better pay attention to data representativeness. Regarding to data preparation, the observations led me to focus on tools for panel data and network data. These immersions and the associated findings are introduced in Chapter4.

1.3 Contributions

Identifying the factors that lead to experience vulnerability is achieved by the researcher in confronting knowledge coming from the literature with empirical evi-dence raised by means of software and data analysis methods. The software helps the researcher to handle data and data analysis methods help the researcher to extract relevant information from data. However, the very understanding of what happens to the population studied and the classification to validate the potential causal links belongs to researchers. Most confirmatory analyses rely on regression techniques that can describe relationships but do not provide certainty on the un-derlying causal mechanism. Therefore, drawing conclusions about the validity of the tested hypotheses belongs to the researcher. Bearing that in mind, event if the methodological contributions introduced in this research work aim to identify underlying factors of vulnerability, they actually only support the researcher in this identification. The responsibility of validating what are the underlying factors in connection with a particular vulnerability still belongs to researchers.

1.3. Contributions 15

1.3.1 Conceptual model of the vulnerability in life courses

Firstly, the thesis provides the human vulnerability research area with a conceptual contribution. Information science and analytical methods have to be elaborated in regard to well-defined conceptual frameworks and conceptual models. However, neither a framework nor a model of vulnerability in life courses for interdisciplinary research has nowadays met consensus. I contribute to the discussion by providing a proposition of conceptual model of the diffusion of vulnerability in life courses as a dynamic process. This model is generic enough to be instantiated in various research areas of social sciences and brings conceptual and terminological clarity and consistency. However, it is not a comprehensive model as it is based on several symplifying assumptions. In the model, an underlying factor of vulnerability refers to a set of two or more individual or environmental resources for which there exists an interaction between their possible states that, possibly combined with an interaction effect with one or more stressors, involve a change in one or more vulnerability components: either exposure to stressors, sensitivity to stressors, or resilience capacity. I introduce the model in Section 3.1. I assess the model by confronting it with the other frameworks and models of vulnerability in life courses previously reviewed. Results show that the proposed model covers to a large extent all the other approaches proposed in the literature.

1.3.2 Methodological contribution

Secondly, the thesis provides several methodological contributions. The originality of my approach lies in its interdisciplinarity. After having anchored the question of supporting the discovery of underlying factors of vulnerability in life course within the quantitative research workflow traditionally used in social sciences, I formu-lated the issue as both an information system and a data analysis issue. Then, in my second research question, I focus on the exploration of variable interactions in a context of infrequent outcomes through the use of classification tree methods.

Technically, infrequent outcomes refer to an imbalanced data context. I answer the second research question by providing two complementary methodologies for exploring interaction effects in such a regard. The first methodology focuses on the exploratory process while the second methodology focuses on the tree growing pro-cess. The first methodology consists in dynamically exploring the attribute space by combining a preliminary step of bivariate analyses that aims to orientate the ex-ploration process based on the association strength between the descriptors and the outcome variable and then to explore the associations by dynamically tuning the

Technically, infrequent outcomes refer to an imbalanced data context. I answer the second research question by providing two complementary methodologies for exploring interaction effects in such a regard. The first methodology focuses on the exploratory process while the second methodology focuses on the tree growing pro-cess. The first methodology consists in dynamically exploring the attribute space by combining a preliminary step of bivariate analyses that aims to orientate the ex-ploration process based on the association strength between the descriptors and the outcome variable and then to explore the associations by dynamically tuning the