• Aucun résultat trouvé

Modelisation and Information System Tools to Support the Discovery of Interactive Factors of Vulnerabilities in Life Courses

N/A
N/A
Protected

Academic year: 2022

Partager "Modelisation and Information System Tools to Support the Discovery of Interactive Factors of Vulnerabilities in Life Courses"

Copied!
260
0
0

Texte intégral

(1)

Thesis

Reference

Modelisation and Information System Tools to Support the Discovery of Interactive Factors of Vulnerabilities in Life Courses

ROUSSEAUX, Emmanuel

Abstract

In the past decades, the life course perspective and the vulnerability framework have grown in popularity to study how risks spread across people lives. Such studies involve complex longitudinal and network data as well as specific analysis methods. This thesis aims to help the social scientist in managing and analyzing such data. To support the provided methodological contributions, the thesis starts by setting a conceptual model of the diffusion of vulnerability along the life course. Then, the thesis develops several complementary strategies for exploring the set of vulnerability descriptive variables with the aim to identify interaction effects, such as when the gender effect depends on the age. The strategies rely on classification trees and specifically focus on unexpected interaction effects and data imbalance. In an illustrative application focusing on vulnerability to poverty, the proposed methods successfully achieved to identify an unexpected interaction effect between ego's and father's educational resources on ego's unemployment. Several of the contributions are made available to the scientific community [...]

ROUSSEAUX, Emmanuel. Modelisation and Information System Tools to Support the Discovery of Interactive Factors of Vulnerabilities in Life Courses . Thèse de doctorat : Univ. Genève, 2018, no. SdS 107

DOI : 10.13097/archive-ouverte/unige:120604 URN : urn:nbn:ch:unige-1206042

Available at:

http://archive-ouverte.unige.ch/unige:120604

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

Modelisation and Information System Tools to Support the

Discovery of Interactive Factors of Vulnerabilities in

Life Courses

TH` ESE

pr´esent´ee `a la Facult´e des sciences de la soci´et´e de l’Universit´e de Gen`eve

par

Emmanuel Rousseaux

sous la direction de

Prof. Giovanna Di Marzo

et

Prof. Gilbert Ritschard

pour l’obtention du grade de

Docteur ` es sciences de la soci´ et´ e mention syst` emes d’information

Membres du jury de th`ese : Mme Giovanna Di Marzo, Professeure M. MichelOris, Professeur, pr´esident du jury

M. GilbertRitschard, Professeur

M. BorisWernli, FORS, Universit´e de Lausanne, Suisse

M. DjamelZighed, Agence Universitaire de la Francophonie, Quebec, Canada &

Paris, France; Universit´e de Lyon 2, France

Th`ese no107 Gen`eve, 12 d´ecembre 2018

(3)

qui s’y trouvent ´enonc´ees et qui n’engagent que la responsabilit´e de leur auteur.

Gen`eve, le 30 janvier 2019.

Le doyen

BernardDebarbieux

Impression d’apr`es le manuscrit de l’auteur.

(4)

Contents

Abstract v

R´esum´e ix

Remerciements xiii

1 Introduction 1

1.1 Context . . . 2

1.2 Research framework . . . 3

1.2.1 Research issue . . . 3

1.2.2 Research questions . . . 7

1.2.3 Immersions as a social scientist . . . 13

1.3 Contributions . . . 14

1.3.1 Conceptual model of the vulnerability in life courses . . . 15

1.3.2 Methodological contribution . . . 15

1.3.3 Software contribution . . . 16

2 Literature review 19 2.1 The life course and vulnerability frameworks . . . 20

2.1.1 The life course perspective . . . 20

2.1.2 The framework of vulnerability . . . 27

2.1.3 Connection with the life course perspective . . . 33

2.1.4 Conclusion . . . 34

2.2 Classification tree learning with infrequent outcomes . . . 35

2.2.1 Standard classification tree learning approaches . . . 35

2.2.2 The class imbalance issue in classification . . . 48

2.2.3 Standard approaches for dealing with class imbalance . . . . 49

2.2.4 Classification tree specific approaches . . . 52

2.2.5 Assignment rules . . . 62

2.2.6 Conclusion . . . 64

3 Conceptual Model 67 3.1 A modelisation of vulnerability in life courses . . . 71

3.1.1 Connecting life course and vulnerability terminologies . . . . 72

(5)

3.1.2 Vulnerability and vulnerabilization . . . 75

3.1.3 Underlying factors of vulnerability . . . 77

3.1.4 Manifest and latent vulnerability . . . 78

3.2 Exploring underlying factors in a semi-automatic way . . . 79

3.2.1 Mixing up selected weak and strong descriptors . . . 81

3.2.2 Interactive exploration . . . 82

3.3 Three off-centered measures designed for imbalanced data . . . 85

3.3.1 Residual effect induced by the off-centering operation . . . . 86

3.3.2 The ratio entropy . . . 89

3.3.3 The MD score . . . 90

3.3.4 The MDC score . . . 91

3.4 Discussion . . . 94

4 Life course data management: empirical strategies 99 4.1 Settings . . . 100

4.2 Observations . . . 102

4.2.1 Practical issues in data weighting . . . 102

4.2.2 Practical issues in handling life course data . . . 105

4.3 Propositions . . . 109

4.3.1 Supporting data weighting . . . 109

4.3.2 Supporting life course data handling . . . 119

5 Implementation 123 5.1 Development framework . . . 124

5.1.1 R as the hosting software . . . 125

5.1.2 R package environment . . . 126

5.1.3 Object-oriented programming . . . 129

5.2 Software architecture . . . 129

5.2.1 Modular design . . . 130

5.2.2 The packageRsocialdata. . . 132

5.2.3 TheRsocialdata.panelextension . . . 152

5.2.4 TheRsocialdata.networkextension . . . 156

5.2.5 The packageTrim . . . 165

5.3 Implementation of the methodological contributions . . . 171

5.3.1 Mixing up selected weak and strong descriptors . . . 171

5.3.2 Interactive exploration . . . 172

5.3.3 Off-centered measures designed for imbalanced data . . . 176

6 Assessment 183 6.1 A model of vulnerability in life courses . . . 184

(6)

Contents iii

6.1.1 Assessement setting . . . 184

6.1.2 Results . . . 185

6.1.3 Conclusion . . . 188

6.2 Identifying underlying factors in a semi-automatic way . . . 190

6.2.1 Assessement setting . . . 190

6.2.2 Results . . . 194

6.2.3 Conclusion . . . 201

6.3 Three off-centered measures designed for imbalanced data . . . 201

6.3.1 Assessment setting . . . 201

6.3.2 Results . . . 204

6.3.3 Conclusion . . . 207

7 Conclusion 209

List of tables 216

List of figures 218

References 241

(7)
(8)

Abstract

The present Ph.D. thesis in information systems aims to provide social science re- search with a conceptual model and the associated analytic tools to support the discovery of interactive factors of vulnerability in life courses. The past decades have seen several social changes emerge such as the individualization of the soci- ety, the growing of inequalities due to social stratification, and the labor market’s increased demand for flexibility and personal engagement. These social changes place individuals at risk of experiencing undesired situations such as social exclu- sion, poverty, or stress. To study how such risks spread across people lives, the life course perspective and the vulnerability framework have recently grown in popu- larity. The life course perspective provides a conceptual structure for considering how past events in individual’s life history explain and influence future life. The framework of vulnerability provides a conceptual structure to study, as a dynamic process, how a system is exposed to and recover from perturbations. An effective strategy to overcome social vulnerability is to set up social policies targeting vul- nerable people. To do so, a preliminary step is to understand what are the factors in connection with vulnerability outcomes. The thesis addresses this issue by fo- cusing on the discovery of underlying factors of vulnerability in life courses. In this context, the term underlying factor refers to an independent variable that is not available in the data as a single variable. Instead, underlying factors have to be built up by researchers in social sciences by combining several available variables.

This operation often involves interaction effects between variables.

When addressing an information system issue, a first step is to take a holistic view of the activity the system is designed to support. Therefore, my first re- search question addresses the issue of integrating both frameworks of life course and vulnerability together to put forward, under some simplifying assumptions, a conceptual model of the process of vulnerability in life courses. Such a model is primarily expected to clarify the practical meanings of studying vulnerability, as the difference between studying, for example, poverty, and studying vulnerability to poverty. As a result, such a model would serve as a basis and a shared vo- cabulary for researchers in information systems interested in proposing strategies to address the vulnerability in life courses. My second research question focuses on the use of classification tree methods for identifying interaction effects. As I postulate that, in modern societies, vulnerability situations lead to life situations that are infrequent compared to non-vulnerable situations, I focus on identifying interaction effects in a context of class imbalance.

(9)

After reviewing the literature of both life course and vulnerability research fields, I answer the first research question by providing, under some simplifying assumptions, a conceptual model of the diffusion of vulnerability in life courses as a dynamic process. This approach uses the term vulnerability factor to bridge the life course and vulnerability frameworks. In this model, an underlying factor of vulnerability refers to a set of two or more individual or environmental resources for which there exists an interaction between their possible states that, possibly combined with an interaction effect with one or more stressors, involve a change in one or more vulnerability components: either exposure to stressors, sensitivity to stressors, or resilience capacity. I assess this model by confronting it with the other frameworks and models of vulnerability in life courses previously reviewed. Results show that the proposed model covers to a large extent all the other approaches.

After reviewing the literature on the growing of classification trees in an im- balanced data context, I answer the second research question by providing two complementary methodologies for exploring the attribute space and identifying underlying interaction effects with imbalanced data. The first methodology fo- cuses on the exploratory process while the second methodology focuses on the tree growing process. The first methodology consists of the use of a preliminary bi- variate analyses step to separate between strong and weak associations with the outcome variable and then to explore those associations by dynamically tuning the classification tree growing parameters through an interactive graphical interface.

The second methodology consists of better separating between vulnerability out- comes in an imbalanced data context. To do so, I put forward three classification tree growing measures designed for this purpose. I assess the first methodology by conducting an exploration of underlying factors of vulnerability on real data.

The study focuses on the vulnerability to poverty in targeting underlying factors of unemployment among young adults second-generation immigrants in Switzerland.

The methods successfully achieved to identify an unexpected interaction effect be- tween ego’s and father’s educational resources on ego’s unemployment. This result has been published in Guarin and Rousseaux (2017). I assess the second method- ology by comparing the performance of the proposed tree growing measures to the main data imbalanced strategies discussed in the literature review. To this purpose, I conduct experiments on some datasets coming from the “UCI Machine Learning Repository” (Dheeru and Karra Taniskidou, 2017). The results tend to indicate that my contributions are able to achieve the same performances than the other strategies but in building less deeper trees. Such a feature is asked by social scientists that are more interested in result interpretability than raw prediction performance in regard to classification trees.

In a quantitative research process, the data exploration stage refers to three steps: data understanding, data preparation, and exploratory analysis. Both re- search questions focus on the exploratory analysis step. To address the data explo- ration stage globally, I investigate what improvements can be put forward regard- ing data understanding and data preparation. I do this investigation in immersing myself in the role of a social scientist by working as a coauthor on three real stud- ies respectively in health sociology (Cullati et al., 2014), labor sociology (Guarin and Rousseaux,2017), and family sociology (work stil in progress). During these

(10)

Contents vii immersions, I provided several methodological strategies that could be able to im- prove access to data documentation within the statistical software and to better pay attention to data representativeness. These methodological strategies were not formally assessed but are available for testing purpose in the software this thesis releases.

This Ph.D. thesis brings several contributions. Firstly, the thesis provides the human vulnerability research area with a conceptual model of the diffusion of the vulnerability in life courses as a dynamic process. Secondly, the thesis provides researchers in social sciences with two complementary methodologies that facilitate the exploration of interaction effects in a context of class imbalance and aim to pro- vide researchers with some sets of relevant underlying factors in a semi-automatic way. The conceptual and the methodological contributions combined represent my response to the issue of supporting researchers in social sciences in discovering vul- nerabilities factors in the life course. In addition to these contributions, the thesis provides software contributions. The package Rsocialdata provides researchers in social sciences with tools for handling cross-sectional survey data inRand pro- vide generic data structures that can be extended to more complex survey data types. As a key feature, it allows to store, document, and prepare survey data.

It provides tools for effectively exploring and recoding data, allowing researchers to proceed more quickly to analyses. The package Rsocialdata.panel extends the package Rsocialdata to provide ad-hoc tools for effectively preparing panel survey data. The Rsocialdata.network package also extends the Rsocialdata package to provide ad-hoc tools for effectively preparing egocentric network survey data. The packageTrimprovides an implementation of both the main off-centered entropy measures as well as the measures we introduced as new methodological contributions.

(11)
(12)

esum´ e

Cette pr´esente th`ese de doctorat en syst`emes d’information vise `a mettre `a dis- position de la recherche en sciences sociales un mod`ele conceptuel ainsi que les outils analytiques associ´ees permettant d’appuyer la d´ecouverte d’interactions en- tre les facteurs de vuln´erabilit´es dans le cadre des ´etudes de type parcours de vie. Ces derni`eres d´ecennies ont vu ´emerger plusieurs changements sociaux tels que l’individualisation de la soci´et´e, la croissance des in´egalit´es dues `a la strati- fication sociale et la demande accrue du march´e du travail pour la flexibilit´e et l’engagement personnel. Ces changements sociaux exposent les individus `a des situations ind´esirables telles que l’isolement social, la pauvret´e ou le stress. Pour

´etudier comment ces risques se diffusent dans la vie des individus, l’approche du parcours de vie et le cadre de la vuln´erabilit´e ont r´ecemment gagn´e en popu- larit´e. L’approche du parcours de vie fournit une structure conceptuelle permettant d’examiner comment les ´ev´enements pass´es de la vie d’un individu peuvent expli- quer et influencer la vie future. Le cadre de la vuln´erabilit´e fournit une structure conceptuelle permettant d’´etudier, en tant que processus dynamique, comment un syst`eme est expos´e `a des perturbations et s’adapte `a celles-ci. Une strat´egie effi- cace pour surmonter la vuln´erabilit´e sociale consiste `a mettre en place des politiques sociales ciblant les personnes vuln´erables. Pour ce faire, une ´etape pr´eliminaire con- siste `a comprendre quels sont les facteurs li´es aux manifestations de la vuln´erabilit´e.

Cette th`ese aborde ce probl`eme en mettant l’accent sur la d´ecouverte des facteurs sous-jacents de la vuln´erabilit´e dans les cours de vie. Dans ce contexte, le terme fac- teur sous-jacent fait r´ef´erence `a une variable ind´ependante qui n’est pas disponible dans les donn´ees en tant que variable unique. `A la place, les facteurs sous-jacents doivent ˆetre construits par les chercheurs en sciences sociales en combinant plusieurs variables disponibles dans les donn´ees. Cette op´eration implique souvent des effets d’interaction entre les variables.

Lorsque l’on adresse un probl`eme de syst`eme d’information, une premi`ere

´etape est de mod´eliser l’activit´e que le syst`eme doit supporter. Ainsi, ma premi`ere question de recherche aborde la question de l’int´egration conjointe des cadres con- ceptuel du parcours de vie et de la vuln´erabilit´e pour proposer, sous certaines hypoth`eses simplificatrices, un mod`ele conceptuel du processus de vuln´erabilit´e dans les parcours de vie. Un tel mod`ele pourrait servir de base commune pour les chercheurs en syst`emes d’information int´eress´es `a proposer des strat´egies pour abor- der la vuln´erabilit´e dans les cours de la vie. Ma deuxi`eme question de recherche porte sur l’utilisation des m´ethodes d’arbre de d´ecision pour explorer les effets

(13)

d’interaction. En postulant que, dans les soci´et´es modernes, les situations de vuln´erabilit´e sont moins fr´equentes que les situations de non-vuln´erabilit´e, je me concentre sur l’exploration des effets d’interaction dans un contexte de donn´ees d´es´equilibr´ees `a deux classes.

Apr`es avoir revu la litt´erature concernant respectivement l’approche par- cours de vie et la vuln´erabilit´e, je r´eponds `a la premi`ere question de recherche en fournissant, sous certaines hypoth`eses simplificatrices, un mod`ele conceptuel de la diffusion de la vuln´erabilit´e dans les cours de vie en tant que processus dy- namique. Cette approche utilise le terme facteur de vuln´erabilit´e pour faire le pont entre les cadres du parcours de vie et de la vuln´erabilit´e. Dans ce mod`ele, un facteur de vuln´erabilit´e sous-jacent fait r´ef´erence `a un ensemble de deux ou plusieurs ressources individuelles ou environnementales pour lesquelles il existe une interaction entre leurs ´etats possibles qui, ´eventuellement combin´ees avec un effet d’interaction avec un ou plusieurs facteurs de stress, impliquent un changement dans une ou plusieurs composantes de vuln´erabilit´e: l’exposition aux facteurs de stress, la sensibilit´e aux facteurs de stress ou la capacit´e de r´esilience. J’effectue une ´evaluation de ce mod`ele en le confrontant aux autres cadres et mod`eles de vuln´erabilit´e dans les parcours de vie discut´es dans la litt´erature. Les r´esultats montrent que le mod`ele propos´e couvre dans une large mesure toutes les autres approches propos´ees.

Apr`es avoir revu la litt´erature discutant les m´ethodes d’arbres de d´ecision dans un contexte de donn´ees d´es´equilibr´ees, je r´eponds `a la deuxi`eme question de recherche en fournissant deux m´ethodologies compl´ementaires pour explorer les effets d’interaction dans ce contexte. La premi`ere m´ethodologie se concentre sur le processus exploratoire tandis que la seconde m´ethodologie se concentre sur le processus de d´eveloppement des arbres. La premi`ere m´ethode consiste `a utiliser une ´etape d’analyse bivari´ee pr´eliminaire pour s´eparer les associations fortes des associations faibles avec la variable d´ependante, puis `a explorer ces associations en ajustant de fa¸con dynamique les param`etres de d´eveloppement de l’arbre de d´ecision au moyen d’une interface graphique interactive. La seconde m´ethodologie consiste `a mieux pr´edire les r´eification de vuln´erabilit´es dans un contexte de donn´ees d´es´equilibr´ees. Pour ce faire, je propose trois mesures de d´eveloppement d’arbre de d´ecision. J’´evalue la premi`ere m´ethodologie en menant une exploration des fac- teurs de vuln´erabilit´e sous-jacents sur des donn´ees r´eelles. L’´etude en question se concentre sur la vuln´erabilit´e `a la pauvret´e en ciblant les facteurs sous-jacents au chˆomage sur une population de jeunes adultes immigrants de seconde g´en´eration en Suisse. La m´ethodologie propos´ee a permis d’extraire un facteur de vuln´erabilit´e sous-jacent li´e aux ressources ´educatives de l’individu et de son p`ere. Ce r´esultat a

´et´e publi´e dans Guarin and Rousseaux (2017). J’´evalue la seconde m´ethodologie en comparant la performance des mesures de d´eveloppement d’arbres propos´ees avec les principales strat´egies discut´ees dans la revue de la litt´erature. Pour ce faire, je r´ealise des exp´erimentations sur plusieurs jeux de donn´ees provenant du “UCI Machine Learning Repository” (Dheeru and Karra Taniskidou,2017). Les r´esultats tendent `a indiquer que mes contributions sont capables d’atteindre les mˆemes per- formances que les autres strat´egies mais en construisant des arbres moins profonds.

Une telle caract´eristique est privil´egi´ee par les chercheurs en sciences sociales qui

(14)

Contents xi s’int´eressent davantage `a l’interpr´etabilit´e des r´esultats qu’`a la performance brute en ce qui concerne les arbres de d´ecision.

Dans un processus de recherche quantitatif, la phase d’exploration des donn´ees se compose de trois ´etapes : la compr´ehension des donn´ees, la pr´eparation des donn´ees et l’analyse exploratoire. Les deux questions de recherche que j’ai pos´ees se concentrent sur l’´etape de l’analyse exploratoire. Pour aborder la phase d’exploration des donn´ees de mani`ere globale, j’ai ´etudi´e les am´eliorations qui peuvent ˆetre ap- port´ees concernant la compr´ehension des donn´ees et la pr´eparation des donn´ees.

Pour cela, je me suis immerg´e dans le rˆole d’e sociologue d’un chercheur scoence sociales en travaillant comme coauteur sur trois ´etudes r´eelles respectivement en sociologie de la sant´e (Cullati et al., 2014), en sociologie du travail (Guarin and Rousseaux, 2017), et en sociologie de la famille (´etude encore en cours). Au cours de ces immersions, j’ai fourni plusieurs strat´egies m´ethodologiques qui pour- raient ˆetre en mesure d’am´eliorer l’acc`es `a la documentation des donn´ees dans les logiciels de statistique et de mieux prˆeter attention `a la repr´esentativit´e des donn´ees. Ces strat´egies m´ethodologiques n’ont pas ´et´e formellement ´evalu´ees mais sont disponibles `a des fins de test dans les logiciels d´evelopp´es durant ce travail de th`ese.

Cette th`ese de doctorat am`ene plusieurs contributions. Tout d’abord, la th`ese fournit au champ de recherche de la vuln´erabilit´e humaine un mod`ele conceptuel de la diffusion de la vuln´erabilit´e dans les parcours de vie en tant que processus dy- namique. Deuxi`emement, la th`ese fournit aux chercheurs en sciences sociales deux m´ethodologies compl´ementaires qui facilitent l’exploration des effets d’interaction dans un contexte de donn´ees d´es´equilibr´ees `a deux classes et la d´etection de mani`ere semi-automatique d’ensemble de potentiels facteurs sous-jacents de vuln´erabilit´e.

Ces deux contributions conceptuelles et m´ethodologiques combin´ees forment ma r´eponse `a la probl´ematique de fournir un soutien aux chercheurs en sciences so- ciales pour la d´ecouverte des facteurs de vuln´erabilit´e dans les parcours de vie.

En plus de ces contributions, cette th`ese fournit aussi des contributions logicielles.

La librairie Rsocialdata fournit aux chercheurs en sciences sociales des outils pour g´erer les donn´ees d’enquˆetes transversales dansRet fournit des structures de donn´ees g´en´eriques qui peuvent ˆetre ´etendues `a des types de donn´ees d’enquˆete plus complexes. Il permet le stockage, la documentation et la pr´eparation de donn´ees d’enquˆete. Il fournit aussi des outils pour explorer et recoder efficacement les donn´ees, ce qui permet aux chercheurs de se lancer plus rapidement les analyses.

La librairie Rsocialdata.panel ´etend la librarie Rsocialdata pour fournir des outils sp´ecifiques pour pr´eparer efficacement les donn´ees d’enquˆete de type panel.

Le paquet Rsocialdata.network ´etend ´egalement la librairie Rsocialdata pour fournir des outils sp´ecifiques `a la pr´eparation efficace de donn´ees d’enquˆete de r´eseau ´egocentriques. La librarie Trimfournit une impl´ementation des principales mesures d’entropie d´ecentr´ee ainsi que les mesures introduites en tant que nouvelles contributions m´ethodologiques.

(15)
(16)

Remerciements

Je tiens tout d’abord `a remercier GilbertRitschardde m’avoir donn´e l’opportunit´e d’initier ce travail de th`ese et de l’avoir co-dirig´e. Gilbert, je te remercie de l’encadrement que tu m’as apport´e tout au long de ce projet et de ton soutien ind´efectible et bienveillant. Je te remercie ´egalement pour tes minutieuses et per- spicaces relectures qui m’ont souvent oblig´e `a questionner mon travail et m’ont par l`a mˆeme permis de l’orienter dans les bonnes directions. Aussi, `a tes cˆot´es j’ai pu appr´ecier dans tout leur sens ce que rigueur et pr´ecision signifient et je t’en remercie.

Je tiens ´egalement `a remercier GiovannaDi Marzod’avoir co-dirig´e ce tra- vail de th`ese. Giovanna, je te remercie de ton suivi tout au long de ce projet et particuli`erement de ton appui dans la structuration de mon travail. Aussi, je te remercie de m’avoir partag´e avec p´edagogie ta vision des syst`emes d’information et de m’avoir aiguill´e dans la compr´ehension que ce que l’on entend, et par cons´equent de ce que l’on attend, d’une th`ese en syst`emes d’information.

J’adresse en outre mes remerciements `a MichelOrisqui m’a fait l’honneur de pr´esider mon jury de th`ese. Michel, je te remercie pour tes critiques constructives sur mon mod`ele conceptuel et notamment sur ma proposition de mod´elisation de la vuln´erabilit´e au sein des parcours de vie. Par ailleurs, je souhaite te remercier pour les diff´erents environnements interdisciplinaires que tu as activement particip´es `a mettre en place `a l’Universit´e de Gen`eve et en Suisse romande. Il est ´evident que le travail que je pr´esente ici a germ´e dans le terreau fertile de cet ´ecosyst`eme.

Mes remerciements s’adressent ´egalement `a DjamelZighedet BorisWernli qui m’ont fait l’honneur d’ˆetre jur´es de cette th`ese. Djamel, tu as, par l’interm´ediaire de ton cours de master sur les m´ethodes de classification, et notamment les m´ethodes d’arbre, significativement inspir´e ce projet de th`ese et je t’en remercie. Je te re- mercie de surcroˆıt pour l’ensemble des ´echanges que nous avons eus et `a travers desquels tu m’as partag´e tes larges connaissances des diff´erentes m´ethodologies de mod´elisation et d’extraction des connaissances. Boris, je te remercie sinc`erement pour ta relecture minutieuse de mon travail et l’enthousiasme que tu y as port´e. Je te remercie ´egalement pour tes nombreux conseils et suggestions avis´es qui m’ont permis de pr´eciser mon propos et d’affiner la port´ee de plusieurs de mes proposi- tions.

En outre, je remercie les doctorants et chercheurs du NCCR LIVES, des facult´es SES, SdS et GSEM de l’Universit´e de Gen`eve, pour tous les ´echanges,

(17)

les discussions, les questions pos´ees lors de mes pr´esentations et l’enthousiasme manifest´e qui m’a motiv´e `a avancer. En particulier, je remercie St´ephaneCullati, Andr´es Guarin et MyriamGirardin pour le temps qu’ils m’ont consacr´e et les riches discussions que nous avons eues, discussions qui ont activement contribu´e `a fa¸conner mon projet de recherche.

Je ne peux terminer sans remercier chaleureusement ma famille et mes proches qui m’ont soutenu et encourag´e dans et tout au long de ce projet. Je remercie tout particuli`erement mes parents, Jean-Louis et Francette, et mon fr`ere, Jean-Fran¸cois, d’avoir cultiv´e mon goˆut d’apprendre et de m’avoir donn´e envie d’approfondir tou- jours davantage mes connaissances. Aussi, je remercie intens´ement ma compagne, Delphine, de m’avoir accompagn´e et soutenu ces derni`eres ann´ees et de m’avoir offert un environnement propice `a la r´ealisation de ce travail de th`ese.

Cette th`ese a ´et´e une exp´erience exceptionnelle et extrˆemement enrichissante.

`A toutes celles et ceux qui m’ont permis de la vivre, je vous adresse mes tr`es sinc`eres remerciements.

Cette publication a b´en´efici´e du soutien du Pˆole de recherche national LIVES - Surmonter la vuln´erabilit´e : perspective du parcours de vie (IP214), financ´e par le Fonds national suisse. L’auteur remercie le Fonds national suisse de son aide financi`ere.

(18)

A mes parents et `` a mon fr`ere.

(19)
(20)

Chapter 1

Introduction

1.1 Context. . . 2

1.2 Research framework . . . 3

1.2.1 Research issue . . . 3

1.2.2 Research questions. . . 7

1.2.2.1 Expliciting a model of vulnerability in life courses . . . . 7

1.2.2.2 Identifying interaction effects with infrequent outcomes . 9 1.2.3 Immersions as a social scientist. . . 13

1.3 Contributions. . . 14

1.3.1 Conceptual model of the vulnerability in life courses . . . 15

1.3.2 Methodological contribution . . . 15

1.3.3 Software contribution . . . 16

(21)

1.1 Context

Individual life is a dynamic process affected by a multitude of events. Some of the life events may have positive consequences for individuals; some others may have negative consequences. Life events also differ in term of impact on individual lives. Buying a new sofa or celebrating New Year’s Eve are such life events that are expected to have minor impacts on the individual life path. On the opposite, moving to a new country, becoming a parent, or falling in long-term unemployment are such life events that are expected to have significant impacts on the individual life path.

Depending on resources of an individual, such as economic resources, social capital, physical health, psychological resources or social support, the same life event will not have the same consequences on individuals’ lives. For example, a short-term loss of employment may be easier to face for a married couple with no children than for a single woman with two dependent children. Resources them- selves shape the kind of events individuals are likely to experience by moderating risks and opportunities. For example, one of the findings of the present thesis is that according to educational level of the individual, educational level of the fa- ther may turn out to moderate individual’s ability to find employment in early adulthood (Guarin and Rousseaux,2017).1

In addition, the timing of an event has a significant influence. Depending on its timing in the life path, a life event will not trigger the same consequences. For example, let us consider the impact of a leg fracture caused by a domestic accident.

It can be assumed that for young people this event will not have significant conse- quences: a few weeks of rest are often sufficient to recover. But for elderly people, recovering from such an accident is often harder. In the worst cases, this domestic accident can precipitate the transition of the individual to the loss of independence.

The past four decades have been characterized as a period of growing un- certainty (Beck,1992) when new social risks emerged as (1) family discontinuities and the labour market’s increased demand for flexibility and personal engagement, (2) the individualization of the society that places individuals under a high and continuous pressure to make the right choices for their own lives, (3) the diffusion of stress across life domains and between related individuals in a context of contin- gent work life courses, (4) the growing of inequalities related to social stratification such as the “working poor”, (5) and the emergence of new social risks that dispro- portionately affect specific sub-populations such as young adults or female-headed households (Spini et al., 2013). According to the authors, vulnerability in these

“risk” or “uncertain” societies is a growing concern for individuals, political leaders, and academics.

The Swiss National Centre of Competence in Research (NCCR)LIVES“Over- coming vulnerabilities: Life Course Perspectives” began in 2011. The NCCR LIVES aims to better understand the phenomenon of vulnerability using a lon- gitudinal and comparative approach. It also focuses on the means to overcome

1This result has been shown on a population of second-generation immigrants in Switzerland.

Second-generation migrants refer to children of immigrants who were educated and socialized in their parents’ host country. Please refer to Guarin and Rousseaux (2017) for detailed information.

(22)

1.2. Research framework 3 vulnerability so as to contribute to the emergence of innovative social policy mea- sures.

This Ph.D. thesis in information systems is conducted within the NCCR LIVES. The Centre of Competence LIVES is divided into several specific thematic teams, and I worked within the methodological individual project (IP) 14/214. The primary objective pursued through the methodological IP is to provide researchers in social sciences with effective tools for measuring vulnerability and exploring, vi- sualizing and analyzing life course data. This IP is titled “Measuring vulnerability”

and is led by GilbertRitschard.

Within this team, my thesis project is to investigate methodological strate- gies that could be set up to support researchers in social sciences in discovering interactive factors of vulnerability in life courses. Having a background both in in- formation systems and quantitative methods, my strategy was to benefit from both approaches in this study. Also, I conceived this work from an interdisciplinary per- spective with a deep anchor in human sciences. In this sense, the present work takes an enterprise architecture approach by integrating business strategic objectives and business processes in the modelisation of the information system. More specifically, I consider in this work the information system five-layer model of Long´ep´e (2009) and described in Table1.1.

1.2 Research framework

1.2.1 Research issue

The thesis focuses on the factors that are in connection with a particular vulnera- bility, and that I call here factors of vulnerability2. By a literal interpretation, the term factor of vulnerability refers to an observable situation (factor) that (a) in- creases or decreases the likelihood of experiencing a particular vulnerability or (b) changes the level of vulnerability of an already vulnerable individual. For example, the factor “type of employment contract” is a potential factor of vulnerability to unemployment as one of its possible values, “fixed-term contract”, may increase the probability of experiencing unemployment.

As Spini et al. (2013) note, vulnerability pertains to the interaction of indi- vidual and contextual dimensions. For example, having a fixed-term contract in times of full employmentdoes not per se make individuals vulnerable to unemploy- ment. Therefore, a better factor of vulnerability would result from the interaction between the individual resource “type of employment contract” and the contextual resource “unemployment rate of the area”.

In addition, as Adger (2006) notes, social processes are complex and with many linkages that are difficult to pin down. Therefore, I postulate that most of the relevant factors of vulnerability lie in the interaction of several variables.

But factors of vulnerability resulting from interaction effects are more difficult to identify for the researcher in social sciences as they are not directly available as

2I formalize the definition of a factor of vulnerability in Section3.1.1

(23)

Table 1.1 –Information system five-layer model of Long´ep´e (2009).

Layer Description

Strategic

The strategic layer defines the business objectives of the com- pany. The business objectives concern both external objectives such as new services to be provided or performance objectives to be achieved, as well as internal objectives such as organiza- tional changes or reducing operating costs.

Process

The process layer defines the different activities required to achieve the business objectives defined in the strategic layer as well as the respective functions and skills needed to complete each activity. The activities are organized and orchestrated to- gether by identifying both their sequence and sequencing con- ditions.

Functional

The functional layer defines the hierarchical structure of the different functions performing the activities defined in the pro- cess layer. For example, the responsibilities of the financial function include invoice and payment management, account- ing management, and budget planning. This general function is broken down into more specific sub-functions related to each of the activities to be executed.

Applicative

The applicative layer defines the operations that have been automated by software units. Software units include the ap- plications and services used by the functions of the company as well as the libraries that allow applications to work. The dependency relationships between each software component as well as data structures and data flows are modeled.

Infrastructure

The infrastructure layer defines the set of hardware resources required for the proper execution of the software units. The hardware resources include data storage components, network components, and physical servers. The physical storage lo- cations of each element are referenced, and the connections between each component are modeled. Each software unit is linked to the hardware it requires.

a single variable in data – this is especially the case when factors of vulnerability result from high-level interactions, such as those involving more than three vari- ables. In addition, interaction effects may be buried under main effects of some strong covariates. For those reasons, I refer to such factors asunderlying factors of vulnerability.

The thesis addresses the issue of the discovery by researchers in social sciences of the underlying factors associated with vulnerability.

As a starting point, it has to be asked where this discovery stage takes place in

(24)

1.2. Research framework 5 the quantitative research process followed by researchers in social sciences. From a general point a view, all scientific research is an iterative process of observation, ra- tionalization, and validation (Bhattacherjee,2012). The observation phase consists in observing a natural or social phenomenon, event, or behavior that deserves con- sideration. The rationalization phase consists in logically connecting the different pieces of the puzzle that has been observed and integrating them into an existing theory with the purpose of building a new one that includes additional hypotheses.

Finally, the validation phase consists in testing the new theory by using a scientific method through a process of data analysis, and in doing so, possibly, validating the new theory. The two first steps of observation and rationalization call for both an inductive reasoning and a deductive reasoning. An inductive reasoning takes place when starting from some observations and then attempting to rationalize them. A deductive reasoning takes place when starting from an ex-ante rationalization or theory and attempting to integrate new hypotheses based on the observations. By conducting both stages in parallel or iteratively, the researcher ends with a concep- tual model. To validate the model empirically, data are required. To acquire the data needed for this stage, either new data is collected or existing data previously collected are retrieved. Especially when the study focuses on a large sample of individuals such as a national or international population, like in demography and sociology, researchers opt for data collected by a professional or national survey in- stitution. In this case, a first stage for the researcher is to understand the data to be able to correctly prepare them (Wirth and Hipp,2000). Once data is prepared in the statistical software, one could directly go to confirmatory analyses. But in the context of looking for relevant factors of vulnerability, it is much more appro- priate to go through an exploratory analysis. Here, the exploratory analysis must not be confused with descriptive analysis. While performing descriptive statistics is a passive stage limited to extract some indicators in order to have a first under- standing of the data or to compare them with other data, exploratory analysis is an active stage involving a series of predictive analyses with a bottom-up strategy to extract the factors that impact the most the variables of interest (Kuonen,2015).

Although often under-used, the exploratory analysis is a very important stage: as Tukey (1980) notes, new ideas come more often from previous explorations than from lightning strokes. Thanks to the exploratory analysis, the researcher may end with a refined model that goes further than the initial hypotheses. Then, the confirmatory analysis is performed on this final model.

This quantitative research process is illustrated in Figure1.1. It is clear that the issue of discovering factors of vulnerability belongs to the earlier stages of this process. More precisely, I postulate that discovering appropriate factors in connection with a particular vulnerability, a researcher in social sciences has to focus on the steps A to C and E to G reported on the figure. Steps A to C refer to the observation stage and rationalization stage. Steps E to G refer to the data exploration stage. The thesis focuses on the data exploration stage by postulating that a more effective data exploration stage would facilitate the discovery of the underlying factors in connection with the vulnerability studied.

As shown by the steps E to G in Figure1.1, this stage involves understanding data, preparing data, and the use of exploratory analyses.

(25)

Figure 1.1 – An example of traditional quantitative research process based on Bhattacherjee (2012) and Wirth and Hipp (2000). Stages of concern in the context of discovering some factors of vulnerability are reported in red.

The core of the thesis focuses on the exploratory step (G/G’) in the context of the discovery of underlying factors of vulnerability in life courses. I address this issue through two research questions. As information systems are designed to sup- port processes, my first research question focuses on the possibility of formalizing a model of the process of the vulnerability in life courses. To facilitate the design of information system strategies, the objective pursued is to provide researchers with an operationalizable model of the vulnerability in life courses. I introduce this first research question (Q1) in Section 1.2.2.1. Once such a model is set up, my second research question focuses on supporting researchers in social sciences in exploring variable interactions to identify potential factors of vulnerability. As I postulate that outcomes resulting from vulnerable situations may be infrequent in a general population, I focus in identifying interaction effects with infrequent outcomes. I introduce this second research question (Q2) in Section1.2.2.2.

But to address the issue globally, I also aim to investigate the data under- standing step (E) and the data preparation step (F). The NCCR LIVES gathers together a high number of researchers in social sciences. Taking advantage of this opportunity, my strategy is to immerse myself in the role of a social scientist by collaborating on real social science studies and to observe what technical difficulties researchers experience when working on data, and the understanding and prepara- tion of it. In this second work, my aim is not to assess but to identify and explore some strategies that could be studied in future works. Therefore, they will not be discussed in the assessment section of the thesis (Section6). Instead, I made some of these strategies available to the scientific community by adding them within

(26)

1.2. Research framework 7 the software developed for the assessment of my conceptual model (Section 5). I introduce these immersions in Section1.2.3.

1.2.2 Research questions

1.2.2.1 Expliciting a model of vulnerability in life courses

The life course perspective, also known as the life course approach, is a theoretical model that looks at how chronological age, relationships, common life transitions, and social change shape people’s lives from birth to death (Hutchison,2010). The life course approach especially focuses on individual’s life history to explain how early events influence future decisions and events, such as marriage and divorce (White and Klein, 2008, p. 122). In its simplest form, the life course of an indi- vidual can be defined as “a sequence of socially defined events and roles that the individual enacts over time” (Giele and Elder, 1998). In this context, examining the life course is about analyzing changes (Hendricks, 2012). In addition, a key point of the life course approach is also to focus on the connection between individ- uals and the historical and socioeconomic context in which these individuals lived (Elder et al.,2003). Indeed, although both time and individual characteristics are two significant dimensions of human behavior, the environment in which the person lives also plays a part (Hutchison,2010).

It is wished to every individual to live the happiest life possible. However, life is not a long quiet river, and most individuals encounter difficulties in their life course. These difficulties often occur when experiencing changes and transitions that are a source of vulnerability (Fisher and Hood, 1987). These difficulties can emerge from an unfortunate combination of circumstances, related, for example, to an undesired interaction between historical time and one’s personal time, or/and insufficient support within one’s micro-social environment. Through social policies, societies aim to provide resources for both protecting individuals against the onset of negative events or transitions and helping individuals to recover when fallen into a negative or undesired situation. To set up effective social policies, it is necessary to figure out which individuals are at risk and what kind of help these people would need to be able to stay on a happy trajectory.

To understand how to help individuals experiencing difficulties, a first ap- proach is to analyze the negative state in itself. For example, regarding poverty, a strategy could be to compare between poor and non-poor people what elements differentiate them. But to go further, let us note that some individuals may be in a situation very at risk of falling into poverty without the poverty being observ- able by now. For example, a family with four children and a mortgage whose two parents were working in the same company and are now both unemployed because of the company wound up the past year, is at risk of financial hardship. They may be able to survive if they find a job quickly enough, but if the situation persists, it is likely that the financial situation will deteriorate. In such a configuration the family may need assistance to overcome this difficult situation. This means that studying the observable state of poverty is not sufficient, studying the risk of falling in this state has also to be addressed.

(27)

In addition, when facing adversity, some individuals are able to find solutions for getting out the undesired situation they felt in, while some others are stuck in this situation and some others fall further. This variability can be related to differ- ent attitudes and behaviors that individuals adopt and one’s individual resources that will be effectively monopolized. In our previous example, if the parents have both a strong professional social network, they may be able to find employment again relatively quickly. The capacity of resilience to a negative situation should also be studied to understand who will be the most vulnerable so as social policies target them first.

Such a global consideration suggests that experiencing a negative life course event or transition should be seen as a dynamic process. Moser (1998) distinguishes between poverty and vulnerability to poverty as the former being a static concept while the latter should be able to capture change processes. To capture and formal- ize this dynamic process, the theoretical framework of vulnerability has, in recent years, increasingly been used in various fields of social science research.

However, although there are some basic concepts that are common to most conceptualizations of vulnerability, there is currently no consensus on a formal def- inition of the concept of vulnerability in life courses (Schr¨oder-Butterfill and Mari- anti, 2006; Spini et al., 2013). In particular, the life course perspective and the framework of vulnerability have each their own set of concepts. On the one hand, the life course perspective refers to concepts such as roles, life events, transitions, trajectories and life courses. On the other hand, the framework of vulnerability refers to concepts such as stressors, outcome, resources, exposure, sensitivity, re- silience. Our first research question is to study the possibility to integrate both terminologies together and put forward a conceptual model of the process of the vulnerability in life courses. Adopting an operational perspective, such a model is intended to clarify the practical considerations in connection with studying vulner- ability and to support researchers interested in proposing strategies to address the vulnerability in life courses. However, integrating all the concepts and processes that the vulnerability in the life course implies and modeling it into a single system is an impossible exercise to be carried out holistically. Some simplifying assump- tions have to be made to limit the scope of the study. In the present work, I limit the scope of the study to the following framework:

1. Vulnerability is considered to be defined in regards to an undesired life state (subsequently called outcome) instead of considering simultaneously the set of all possible undesired life states that could be associated with the study of human vulnerability in a broad sense. Therefore, the proposed model does not provide a holistic model of vulnerability in the life course. There are three reasons that lead me to consider vulnerability to a specific outcome: (a) I consider that providing a holistic model of vulnerability in life course is a very complex task that is beyond the scope of a thesis in information systems, (b) the perspective I adopt in regards to human vulnerability is not much to acquire a theoretical understanding but to acquire a practical understanding that enables social actors to act. As social actors are usually organized to target each a specific outcome (poverty, depression, etc.), it makes sense to adopt such a perspective in the present work, and (c) I consider that one can

(28)

1.2. Research framework 9 be vulnerable to some outcome without being vulnerable to another one. For example, one may be vulnerable to depression while not being vulnerable to poverty. Of course, both can be linked. Especially depression can lead to losing one’s job. But my way of thinking this connection is to consider that depression (or, to formulate it as a resource, the state of psychological health) is a factor of vulnerability to poverty.

2. Vulnerability is studied from an operational perspective. In particular, I aim to put forward a model relatively simple to instantiate for studying a particular vulnerability, even if such an objective may require to make some technical simplifications. As a result, the proposed model may be limited in its capacity to consider the various sociological models that address the study of human vulnerability.

3. As the present work targets the vulnerability in the life course, the outcome is assumed to belong to one of the possible states individuals can experience in their life course.

4. The present work considers the environmental dimensions of the life course, including temporal dimensions. Therefore, the present work adopts a linear perspective of vulnerability as proposed by Spini et al. (2013, 2017). Such a perspective contrasts with circular models of vulnerability, like, for example, those used to model stress processes (Almeida,2005).

5. Addressing the identification of risk factors underlying to risk factors iden- tified by conventional analyzes, the present work puts an emphasize in the identification of interaction effects between two or more explanatory variables (manifest or latent). This research question is detailed in Section1.2.2.2.

6. Vulnerability is considered from the perspective of risk. However, the risk is here considered from a three-dimensional perspective3. To make a clear distinction with a one-dimensional definition of risk, I use the term factor of vulnerability instead. This term will also allow me to incorporate the notion of interaction effects.

7. Vulnerability outcomes, such as poverty, exclusion, mental disorder, are ex- pected to be less frequent than other life situation. This assumption is de- tailed in Section1.2.2.2.

To address this research question, I conducted a literature review on both the life course perspective and the framework of vulnerability. This literature review is presented in Section2.1.

1.2.2.2 Identifying interaction effects with infrequent outcomes

In data analysis or statistics, when analyzing the effect of some explanatory factors, also called predictors or independent variables, on a variable of interest, also called dependent variable, response, or predicted variable, the term interaction refers to a situation in which the impact on the dependent variable of one independent variable is different depending on the values of another independent variable. Another possible formulation is that the effect of one independent variable on the dependent variable is not the same at all levels of the other independent variable.

3I formalize these dimensions in Section3.1.1

(29)

At the beginning of this chapter, I implicitly introduced an interaction effect when discussing the example of the leg fracture caused by a domestic accident.

For young people, it can be assumed that such an event will not have significant consequences for the individual: a few weeks of rest are often sufficient to recover.

But for elderly people, recovering from such an accident is often harder. In the worst cases, this domestic accident can precipitate the transition of the individual to the loss of independence. This means that there exists an interaction between the variable “experiencing a leg fracture” and the age.

From a more general point of view, there are two reasons for which I expect the study of vulnerability in life courses to involve interaction effects.

Firstly, in the life course perspective, individuals are studied in taking into account various aspects of their lives including biological, psychological and social characteristics. In addition, individuals are not studies in isolation but as entities delved into several contextual environments including a micro social environment, and macro social environment, a geographical environment and several dimensions of the time including the historical time, the individual time and the social time.

Such a holistic approach involve the use of a higher number variables in the analysis.

By design, increasing the number of variables increases the number of possible interactions effects.

Secondly, as the title of the NCCR LIVES project “Overcoming vulnerability:

life course perspective” suggests, the overall objective of studying the vulnerability in life courses is not only to understand vulnerability better but also to find so- lutions to avoid vulnerability or to reduce vulnerability. On a data analysis level, a possible strategy to do this is to look for interaction effects between a previ- ously identified factor of vulnerability and another covariate that allow to disable or reduce the outcomes related to the factor of vulnerability.

In addition, it is relevant to note that outcomes resulting from vulnerable situ- ations are possibly infrequent, and sometimes rare, in a general population. Indeed, modern societies are organized around a number of laws and public institutions that aims to protect individuals against a number of life hazards. For instance, labor laws provide individuals with support to prevent falling unexpectedly in unemploy- ment. National and regional employment offices provide unemployed individuals with financial support and accompany them in the steps of finding a new profes- sional position. This organization of the society minimizes, to some extent, the risk of falling in undesired life situations. As a hopeful consequence, outcomes resulting from vulnerability states are often experienced by a small proportion of the pop- ulation. For instance, the unemployment rate of economically active youth living in Switzerland and aged 20 to 24 has been shown to be, according to the country of origin, between 3% and 11% in 2000 (Fibbi et al.,2006). The share of the Swiss resident population below the absolute poverty threshold in term of disposable in- come (i.e. the gross household income subtracted by compulsory expenditure such as social insurance contributions, taxes, basic health insurance premiums, alimony and other maintenance payments) in private households has been estimated by the Federal Statistical Office to be about 6.6% in 2014 (Swiss Federal Statistical Office, 2016). In the United States, the probability of developing invasive cancer for free of cancer citizens aged 40 to 59 has been estimated by the American Cancer Soci-

(30)

1.2. Research framework 11 ety to be about 9% in 2012 (Siegel et al., 2012). These examples show that when studying vulnerability on a general population of an industrialized country, we have to expect an underrepresentation of the outcomes experienced by the population.

On an analytical level, this underrepresentation entails an imbalanced distribution of the dependent variable4. The imbalance among classes of the dependent variable often leads to a poor prediction rate of the minority class. This issue is well-known in the literature as theimbalanced data issue. However, the minority class appears to be in most cases the class of interest (the class of vulnerable individuals) and for this reason, a high recall rate is desired on this class.

However, it has to be noted that a vulnerability outcome is not always asso- ciated with an underrepresentation in data. In particular, the individualisation of the society that has been taking place in the past decades leads individuals to have to assume more often personally challenges and failures. In addition, difficulties faced by individuals in their lives have “democratized”. In particular, rather than being differentially distributed between social classes, the emergence of stressors and observable outcomes vary according to periods of life (Leisering and Leibfried, 2001; Spini et al.,2013,2017; Oris,2017).

In addition, according to the nature of the phenomenon studied, socio-economic parameters and the targeted population, the magnitude of the imbalance could be strongly reduced in particular situations. For example, in the past decades, Spanish unemployment rose dramatically. Observed at a reasonable level of 5 percent during the first half of the 1970s, Spanish unemployment successively increased to reach 24 percent during the 1990s (Dolado and Jimeno, 1997). In extreme situations, such as significant outbreaks, the rate of affected people by a particular outcome can rise even higher. For example, during the Medieval Black Death, vulnerability to contracting the plague was unfortunately almost balanced in the population as it is reported that up to 50-60 percent of the European population was killed by the disease between 1347–1351 (Benedictow (2004, p. 383) and DeWitte (2014)).

Nowadays, comparable vulnerability rates about, for example, poverty or insecu- rity occur in specific areas such as in impoverished cities as Detroit, Michigan, (C. A. Wilson, 1992) or refugee camps as Sangatte, France, (Schwenken, 2014).

In this context, I will pay attention that the contribution I will put forward for exploring variable interactions with infrequent outcomes will be able to work in both balanced and imbalanced contexts. That is, in other words, being balance insensitive.

Common methods used to identify interaction effects in classification include, but not exclusively, classification trees methods (Kass,1980; Breiman et al.,1984), regression methods (McCullagh and John A Nelder,1989; John Ashworth Nelder and Baker, 2004), bayesian networks (Pearl,1986) and association rules (Agrawal et al., 1993). In the present work, I focus on classification tree methods. I make this choice on the basis of two criteria: (1) the ability to identify interactions among a large number of variables and (2) the ability to represent the interactions identified in such a way as to allow a quick diagnosis by practitioners regarding

4Such an imbalance also affects the observation stage: as there are fewer occasions to observe such vulnerable outcomes, researchers have less empirical examples and pieces of evidence allowing to figure out what successions of events or transition led to the outcome.

(31)

their relevance in connection with their theoretical model. Regression models are probably the most used tool in the social sciences. This would be the preferred tool for a thesis aimed at providing methodological tools for social scientists. Regres- sion methods allow interaction effects to be highlighted provided that continuous variables are centered to avoid multicollinearity issues (Hayes, 2017). However, in a regression model, interaction effects consume a large number of degrees of freedom. Therefore, to ensure model convergence, it is necessary to limit both the number interaction effects of testing as well as the number of orders of these interaction effects. Therefore, instead of using hypothesis-testing based model, I recommend to explore the space of predictors with a statistics-free method. Also, as part of this thesis, we are interested in situations of vulnerability, and therefore the expected target variables correspond to potentially less frequent life situations (poverty, stress, exclusion, etc.). A small number of observations on the class of interest increases the model convergence issue and therefore further limits the number of interactions that can be tested simultaneously. This effect will be even more pronounced for predictor variables with a large number of modalities. When observations are broken down into a lot of classes, it is more difficult to obtain significant results than when classes are correctly grouped together. Classification trees provide a solution to this point: by performing recursive and step-by-step partitioning of the population, a large number of possible groupings are tested suc- cessively. Moreover, the fact that the splits of a same level of the tree are built independently of each other, raises the emergence of interaction effects. Bayesian networks are an effective tool for identifying conditional dependencies between the set of descriptive variables and thereby can identify interaction effects. In partic- ular, bayesian networks can simultaneously consider expert knowledge by acting a priori on the structure of the graph as well as the empirical evidence contained in data (Heckerman et al., 1995). On their side, association rules allow identifying associations among the frequent sets of co-occurrences of modalities according to different measures of interest. By comparing the rules to each other, and especially when the search involves both positive and negative rules (Wu et al., 2004), it is possible to identify interaction effects. However, to quickly identify an interaction effect, several types of information have to be made available to the practitioner, including how the class distribution of the target variable changes according to the values taken by the modalities of the predictor variables. The presentation of the results must also be done so as to not overwhelm the practitioner with too much in- formation. However, Bayesian networks and association rules produce outputs that are often difficult to interpret (Bayat et al.,2009; S. Kotsiantis and Kanellopoulos, 2006). Regression models are also more complex to read in a multinomial context.

In contrast, decision trees can render both the splits and the distribution of the de- pendent variable within each node making easy for practitioners to assess changes at each level of the tree. Such an intuitive graphical representation of the results allows practitioners to easily identify interactions, even of multiple orders. Another concern when working with life course data is the ability to handle temporality.

Decision trees are able to handle temporal data. Considering longitudinal data organized in successive waves, there are two ways commonly used for representing data in a tabular way: the wide format and the long format. In the wide format, each row refers to a unique individual and the same variable measured at different

(32)

1.2. Research framework 13 times is stored as separated variables. In the long format, each row refers to a unique individual and time measurement while each variable is stored in a single variable. Considering biographical data, coming for instance from a retrospective survey or life calendar, a long data format is often adopted. With data stored in a wide format, the classification tree treats each variable independently to the others.

Temporality links that exist between variables representing successive measures of the same item are not taken into account when growing the model. However, the tree is able to extract from the whole set of variables the time points that maximize classification quality. But, to keep a reasonable size, the number of variables that are in play has to be limited. As a result, only the most significant association emerge, and a number of other relevant associations may be kept hidden. When using a classification tree on data stored in a long format, the stress is placed on the variables themselves and the temporal information is used to assess whether temporality moderate the effects of a variable. Therefore, classification trees are able to handle longitudinal data but not to take into account all information about temporality.

Therefore, to address the second research question, I conducted a literature review on classification tree learning in the context of infrequent outcomes. This literature review is reported in Section2.2.

1.2.3 Immersions as a social scientist

The research questions introduced in 1.2.2.1 and 1.2.2.2 focus on the exploratory analysis step (G/G’) of the data exploration stage (steps E to G/G’) introduced in Figure 1.1. To address the data exploration stage globally, I also investigate what improvement can be put forward regarding the data understanding (E) and data preparation steps (F).

As the thesis focus on the discovery of factors of vulnerability in life courses, a particular point of interest concerns the use of life course data. Life course data are expected to be more complex than cross-sectional data traditionally used in social sciences. As the life course perspective involves studying trajectories, life course data are expected to contain measures repeated over time. This feature makes database larger and, as a result, more difficult to handle. Additionally, although repeated measures inherently share several characteristics, they may also differ in some other characteristics. Indeed, the survey design is likely to change over time.

For example, the phrasing of some questions may change to make them compatible with the national survey of another country. The rating scale of a variable is also likely to change from, say, a 7-item scale to a 5-item scale for a similar reason.

Such changes make the variables not directly comparable. As a result, additional preprocessing operations may be required to make data ready for analysis. The life course perspective also involves studying the microsocial environment of individu- als. The microsocial environment is made of a lot of linkages that involve the use of egocentric network data to be analyzed. The macrosocial environment plays also an important role in the life course perspective. Taking the macrosocial environ- ment into account in the analysis involve the use of administrative socio-economics data.

(33)

Therefore, I expect that both the increases in volume and the use of differ- ently structured data complicate the task of both understanding and preparing data. Taking both an information system and data analysis point of view, my proposition is to investigate what technical difficulties a researcher in social sci- ences experiences due to statistical software limitations. To get this understand- ing, a quantitative or qualitative approach can be used. A possible quantitative approach is for example to administrate a survey. A possible qualitative approach is for example to participate to practitioner’s activities and make observations. I choosed this latter option as it allows to start exploring needs with no assumption and to successively orientate observation choices based on the results of the previ- ous observation stages. In addition, one of the innovative strategies of the NCCR LIVES is to encourage interdisciplinarity. Being the only one IT researcher within a high number of researchers in social sciences, my strategy was therefore to immerse myself in the role of a social scientist during the first two years of this PhD thesis to get a better understanding of the issues practitioners face in their daily work.

To this purpose, I started three collaborations with researchers in social sciences in three different domains: health sociology, labour sociology, and family sociology.

These three collaborations have in common their study a either a situation or a group of people seen as vulnerable and they follow a life course perspective.

Regarding to data understanding, the observations led me to focus on data documentation access within statistical software and on the use of sampling weights to better pay attention to data representativeness. Regarding to data preparation, the observations led me to focus on tools for panel data and network data. These immersions and the associated findings are introduced in Chapter4.

1.3 Contributions

Identifying the factors that lead to experience vulnerability is achieved by the researcher in confronting knowledge coming from the literature with empirical evi- dence raised by means of software and data analysis methods. The software helps the researcher to handle data and data analysis methods help the researcher to extract relevant information from data. However, the very understanding of what happens to the population studied and the classification to validate the potential causal links belongs to researchers. Most confirmatory analyses rely on regression techniques that can describe relationships but do not provide certainty on the un- derlying causal mechanism. Therefore, drawing conclusions about the validity of the tested hypotheses belongs to the researcher. Bearing that in mind, event if the methodological contributions introduced in this research work aim to identify underlying factors of vulnerability, they actually only support the researcher in this identification. The responsibility of validating what are the underlying factors in connection with a particular vulnerability still belongs to researchers.

Références

Documents relatifs

Considering patients’ medical condition as well as drug factors, after controlling for demographic characteristics and co- morbid conditions, the full factorial analy- sis of

 Columbia  University

–  research discussions –  paper presentations –  practice talks. – 

Class  by  David  Jensen  at  University  of  Massachusetts,  Amherst    .!. DOING RESEARCH

Kurose, “10 pieces of advice I wish my PhD advisor had given

It is from this discovery of natural radio- activity that emerged , first radium- isolated by Pierre and Marie Curie in 1898 -and then its medical applica- tions,

We present a semi-automatic approach to assist the researcher in web service discovery, looking for web services that are appropriate to fulfill the information requirements in the

This paper describes the modelling process of the ROSSIO Thesaurus, its integration and role in the infrastructure, its publication as Linked Open Data, and the results of this