Comparing inverse probability of treatment weighting methods and optimal nonbipartite matching for estimating the causal effect of a multicategorical treatment

(1)

Comparing inverse probability of treatment weighting

methods and optimal nonbipartite matching for

estimating the causal effect of a multicategorical

treatment

Mémoire

Serigne Arona Diop

Maîtrise en statistique - avec mémoire

Maître ès sciences (M. Sc.)

(2)

Comparing inverse probability of treatment weighting

methods and optimal nonbipartite matching for

estimating the causal effect of a multicategorical

treatment

Mémoire

Serigne Arona Diop

Sous la direction de:

Denis Talbot, directeur de recherche Thierry Duchesne, codirecteur de recherche

(3)

Résumé

Des débalancements des covariables entre les groupes de traitement sont souvent présents dans les études observationnelles et peuvent biaiser les comparaisons entre les traitements. Ce biais peut notamment être corrigé grâce à des méthodes de pondération ou d’appariement. Ces méthodes de correction ont rarement été comparées dans un contexte de traitement à plusieurs catégories (>2).

Nous avons mené une étude de simulation pour comparer une méthode d’appariement optimal non-biparti, la pondération par probabilité inverse de traitement ainsi qu’une pondération modifiée analogue à l’appariement (matching weights). Ces comparaisons ont été effectuées dans le cadre de simulation de type Monte Carlo à travers laquelle une variable d’exposition à 3 groupes a été utilisée. Une étude de simulation utilisant des données réelles (plasmode) a été conduite et dans laquelle la variable de traitement avait 5 catégories.

Parmi toutes les méthodes comparées, celle du matching weights apparaît comme étant la plus robuste selon le critère de l’erreur quadratique moyenne. Il en ressort, aussi, que les résultats de la pondération par probabilité inverse de traitement peuvent parfois être améliorés par la troncation. De plus, la performance de la pondération dépend du niveau de chevauchement entre les différents groupes de traitement. La performance de l’appariement optimal non-biparti est, quant à elle, fortement tributaire de la distance maximale pour qu’une paire soit formée (caliper ). Toutefois, le choix du caliper optimal n’est pas facile et demeure une question ouverte. De surcroît, les résultats obtenus avec la simulation plasmode étaient positifs, dans la mesure où une réduction importante du biais a été observée. Toutes les méthodes ont pu réduire significativement le biais de confusion. Avant d’utiliser la pondération de probabilité inverse de traitement, il est recommandé de vérifier la violation de l’hypothèse de positivité ou l’existence de zones de chevauchement entre les différents groupes de traitement.

(4)

Table des matières

Résumé iii

Table des matières iv

Liste des tableaux v

Liste des figures vi

Remerciements viii

Avant-propos ix

Introduction 1

1 Le score de propension 4

1.1 Contexte, définitions et propriétés. . . 4

1.2 Validité de méthodes d’ajustement basées sur le score de propension . . . . 8

2 Comparing inverse probability of treatment weighting methods and optimal nonbipartite matching for estimating the causal effect of a multicategorical treatment 11 2.1 Introduction. . . 13 2.2 Methods . . . 15 2.3 A simulation study . . . 17 2.4 Discussion . . . 30 Conclusion 40 Bibliographie 41

(5)

Liste des tableaux

2.1 Estimate of treatment effect in Scenario 1. True effects are λ1= 1 and λ2 = 1.5. 20

2.2 Estimate of treatment effect in Scenario 2. True effects are λ₁= 1 and λ2 = 1.5. 21

2.5 Estimate of treatment effect in plasmode simulation using 2000 observations. True effects are λ1 = −0.513, λ2 = 0.436, λ3 = 0.099 and λ4 = 0.088. HAC1H

is the reference category . λ1, λ2, λ3 and λ4 refer to treatment effect associated

to Air Tanker, Ground-based action, HAC1F and HAC1R, respectively. . . 31

B Estimate of null treatment effect in Scenario 1. . . 35

C Estimate of null treatment effect in Scenario 2. . . 36

D Estimate of null treatment effect in Scenario 3. . . 37

E Estimate of null treatment effect in Scenario 4. . . 38

(6)

Liste des figures

2.1 Overlap on fitted propensity scores before adjustment in Scenario 1 using 2000 observations. Legend : solid line refers to the treatment group and dashed line to the control groups. The dotted red line represents the median of the treatment group. (a) : Group 1 refers to treatment and Groups 2-3 represent the controls groups. (b) : Group 2 refers to treatment and Groups 3-1 represent the controls groups. (c) : Group 3 refers to treatment and Groups 1-2 represent the controls

groups. . . 22

2.2 MSE of optimal nonbipartite matching as a function of caliper parameter in

Scenario 1. Legend : solid line (λ1), dashed line (λ2) . . . 23

Scenario 2. Legend : solid line (λ₁), dashed line (λ₂) . . . 23

2.4 Overlap on fitted propensity score before adjustment in Scenario 3 using 2000 observations. Legend : solid line refers to the treatment group and dashed line to the control groups. The dotted red line represents the median of the treatment group. (a) : Group 1 refers to treatment and Groups 2-3 represent the controls groups. (b) : Group 2 refers to treatment and Groups 3-1 represent the controls groups. (c) : Group 3 refers to treatment and Groups 1-2 represent the controls

groups. . . 24

Scenario 3. Legend : solid line (λ₁), dashed line (λ₂) . . . 24

2.6 Overlap on fitted propensity scores before adjustment in Scenario 4 using 2000 observations. Legend : solid line refers to the treatment group and dashed line to the control groups. The dotted red line represents the median of the treatment group. (a) : Group 1 refers to treatment and Groups 2-3 represent the controls groups. (b) : Group 2 refers to treatment and Groups 3-1 represent the controls groups. (c) : Group 3 refers to treatment and Groups 1-2 represent the controls

groups. . . 27

2.7 Overlap on fitted propensity scores after adjustment with IPTW in Scenario 4 using 2000 observations. Legend : solid line refers to the treatment group and dashed line to the control groups. The dotted red line represents the median of the treatment group. (a) : Group 1 refers to treatment and Groups 2-3 represent the controls groups. (b) : Group 2 refers to treatment and Groups 3-1 represent the controls groups. (c) : Group 3 refers to treatment and Groups 1-2 represent

the controls groups. . . 28

(7)

2.9 Overlap on fitted propensity scores before adjustment in plasmode simulation using 2000 observations. Legend : solid line refers to the treatment group and dashed line to the control groups. The dotted red line represents the median of

the treatment group. . . 30

2.10 Overlap on fitted propensity scores after adjustment with IPTW in plasmode simulation using 2000 observations. Legend : solid line refers to the treatment group and dashed line to the control groups. The dotted red line represents the

median of the treatment group. . . 30

A Overlap on fitted propensity scores before adjustment in Scenario 2 using 2000 observations. Legend : solid line refers to the treatment group and dashed line to the control groups. The dotted red line represents the median of the treatment group. (a) : Group 1 refers to treatment and Groups 2-3 represent the controls groups. (b) : Group 2 refers to treatment and Groups 3-1 represent the controls groups. (c) : Group 3 refers to treatment and Groups 1-2 represent the controls

(8)

Remerciements

Dans le cadre de ce travail, j’ai largement bénéficié de l’aide, des commentaires et du soutien de plusieurs personnes.

Au terme de la rédaction de ce mémoire, qu’il nous soit permis de remercier Monsieur Denis Talbot, mon directeur de recherche. Nous lui présentons notre gratitude et reconnaissance notamment pour avoir accepté de diriger nos travaux mais aussi pour sa disponibilité, ses sug-gestions et ses encouragements qui m’ont été d’une importance capitale. Je tiens également à remercier Messieurs Thierry Duchesne et Steven G. Cumming pour leurs précieuses remarques et les échanges fructueux avec eux ont contribué grandement à la qualité de mon travail. Je me dois de remercier Monsieur Leandro C. Coelho de la faculté des sciences de l’adminis-tration qui m’a accueilli dans son laboratoire en tant qu’auxiliaire de recherche. Son soutien et ses encouragements ont sans doute eu une grande influence sur mes travaux de recherche. J’en profite pour remercier également toute l’équipe du laboratoire, en particulier Monsieur Hamza Heni.

Je ne saurais manquer d’exprimer ma dette à l’égard de mes parents et mes frères qui ont su, avec une très grande patience, me soutenir et m’encourager dans mes études et bien sûr dans toutes les affaires de ma vie. Je n’oublie pas également la contribution importante apportée par mes amis.

(9)

Avant-propos

Ce mémoire a été rédigé dans le cadre d’une maitrise en statistique. Elle comporte un article qui a été soumis le 27 février 2019 pour publication dans Computational Statistics & Data Analysis. L’auteur principal est Serigne Arona Diop qui y a contribué approximativement à hauteur de 55%. Il a notamment élaboré la méthode présentée dans l’article, l’a implantée dans le logiciel R, a conçu et réalisé les études de simulation, a effectué la recherche bibliographique et a produit la première ébauche de l’article.

Denis Talbot, Thierry Duchesne et Steven G. Cumming ont identifié la problématique étudiée dans l’article et ont dirigé le travail réalisé. Denis a participé à la recherche bibliographie ainsi qu’à la conception des scénarios de simulation, sa contribution est à hauteur de 25%. Thierry Duchesne et Steven G. Cumming ont contribué à la conception de l’analyse illustrative effec-tuée sur les feux de forêts et ont validé la conformité des données utilisées pour la simulation de type plasmode.

Tous les auteurs ont participé à la révision critique de l’ensemble du manuscrit et ont approuvé la version finale.

(10)

Introduction

L’évaluation des interventions, programmes ou politiques est un enjeu majeur auquel les or-ganismes font face. En effet, il est de pratique courante que le décideur soit amené à faire un choix parmi deux (ou plusieurs) interventions concurrentes. Le problème fondamental résulte du fait qu’une seule et unique action peut être engagée parmi celles disponibles et par consé-quent on ne serait jamais en mesure d’apprécier ce qu’il allait advenir si on avait préféré une action différente de celle retenue. Quelqu’un pourrait s’imaginer que le décideur peut avoir la possibilité de mener simultanément plusieurs expérimentations qui sont toutes réalisées dans des conditions tout à fait identiques et d’utiliser dans chacune d’elles un type d’intervention donné, alors les résultats obtenus à l’issue de ces expériences devraient servir de base de com-paraison entre ces différentes interventions. Cette situation hypothétique où l’on cherche à avoir des conditions d’expérimentations contrôlées n’est pas toujours réalisable. Elle est même impossible la plupart du temps, notamment lorsque les conditions d’expérimentations contrô-lées ne sont pas en phase avec les considérations éthiques. Ce type d’expérience favorable aux comparaisons entre différents traitements est appelé expérience randomisée ou aléatoire. Il existe des méthodes d’inférence causale qui permettent de répliquer cette situation hypo-thétique à partir des données observées. Les données considérées présentent souvent beaucoup de défi puisque n’étant pas recueillies pour des fins de comparaisons d’interventions. En effet, les données présentent souvent des débalancements des covariables entre ces différents groupes d’interventions. Par exemple pour comparer l’effet de prendre des statines ou un nouveau mé-dicament utilisé pour réduire le taux de cholestérol chez un malade, le praticien peut utiliser une base de données provenant des assurances, et relatives aux demandes de remboursement des frais de médicaments appartenant à la classe de ceux prescrits pour baisser le cholestérol. Dans un exemple comme celui-là, on peut imaginer que l’indication aux statines versus au nouveau médicament peuvent ne pas être parfaitement identiques. Ainsi, les individus pre-nant un médicament plutôt que l’autre peuvent avoir des caractéristiques différentes. Si ces caractéristiques sont de plus des facteurs de risque de la maladie étudiée, des comparaisons directes sur ce type de données conduit à des résultats biaisés.

Comme l’équilibre global entre les différents groupes d’exposition au regard des covariables n’est pas généralement assuré avec les données observationnelles, des méthodes statistique

(11)

vi-sant à ajuster les comparaisons pour tenir compte ou pour corriger les déséquilibres peuvent être utilisées. Le score de propension, qui correspond conceptuellement, pour un individu donné, à la probabilité d’être exposé à un traitement particulier sachant ses covariables, a été introduit à cet effet. Développé dans le cadre d’une exposition binaire par Rosenbaum & Rubin (1983), le concept du score de propension a été largement accepté et utilisé dans la littérature scientifique. D’ailleurs, une recherche sur Google Scholar, à la date du 9 octobre 2018 à 12:02 pm, retourne environ 782000 résultats en lien avec ce concept. De nombreuses méthodes d’ajustement basées sur le score de propension ont été proposées notamment l’appa-riement et la pondération. Les méthodes d’appal’appa-riement servent à trouver les sous-ensembles des données dans lesquels un équilibre existe et les comparaisons désirées sont par la suite effectuées sur ces parties des données. Les méthodes de pondération visent quant à elles à corriger les déséquilibres en attribuant des poids aux différentes observations de sorte que les groupes d’interventions soient similaires entre eux dans la pseudo-population générée par la pondération.

Pendant longtemps, les auteurs ont été intéressés par l’étude comparative des performances de telles méthodes dans le cadre d’une exposition binaire. D’ailleurs ce grand intérêt a laissé des traces indélébiles notamment à travers une littérature riche, variée et toujours florissante. En guise d’illustration, nous pouvons citer les travaux deAustin & Mamdani(2006) etAustin et al. (2007) qui concluent que les méthodes d’ajustement basées sur le score de propension parviennent à réduire le niveau de débalancement dans les covariables chez les patients traités et non traités, et trouvent qu’en termes de performances, l’appariement est vraisemblablement préférable aux autres méthodes que sont la stratification et la pondération par l’inverse de la probabilité de traitement. En estimant le risque relatif associé au cas binaire, Austin(2008) trouve que les méthodes d’appariement et de stratification donnent des niveaux de perfor-mances similaires en termes d’erreur quadratique moyenne. Dans d’autres études similaires à la précédente, Austin(2010, 2013) conclut que la pondération donne de meilleures perfor-mances que l’appariement et la stratification. Huber et al. (2013) trouvent que la troncation des poids permet d’améliorer les performances des méthodes de pondération. De plus, ces au-teurs concluent que les méthodes d’appariement donnent de meilleurs résultats relativement à la pondération. Après avoir introduit le matching weights, une méthode de pondération ana-logue à l’appariement basé sur le score de propension, Li & Greene (2013) trouvent que leur nouvelle méthode est plus performante que l’appariement et la pondération usuelle.

Des travaux précurseurs (e.g.Imbens,2000;Lechner,2001;Imai & Van Dyk,2004) ont permis de généraliser les propriétés du score de propension aux traitements à plus de 2 catégories. Plus d’une décennie après cette généralisation, force est d’admettre qu’il existe une carence dans la littérature qui ne s’est pas trop intéressée à l’étude comparative des méthodes d’ajustement basées sur le score de propension dans le cadre d’expositions multiples (plus de deux groupes de traitement). En effet, très peu d’études ont été conduites à cet effet (e.g. Govindasamy,

(12)

2016; Yoshida et al., 2017) et la portée de ces dernières s’est limitée aux traitements ayant exactement trois modalités et ce, à travers une méthode d’appariement qui tente de former des trios d’individus ayant des scores de propension similaires mais issus de groupes de traitement différents, soit l’appariement triparti. De plus, les auteurs ont mené ces comparaisons dans le cadre de simulation de type Monte Carlo. Dans son étude, Govindasamy (2016) compare l’appariement triparti, la stratification et la pondération usuelle pour un traitement ayant trois modalités et conclut que ces méthodes présentent des niveaux similaires de performance. Conduite dans un contexte de traitement à trois groupes, l’étude de Yoshida et al. (2017) confirme les tendances obtenues par Li & Greene (2013) dans le cas binaire. En effet, les auteurs trouvent que le matching weights est préférable à l’appariement et à la pondération usuelle dans la réduction du niveau de débalancement entre les covariables pour des traitements à trois groupes.

Dans le cadre de ce mémoire, nous avons tenté d’apporter une contribution à la littérature et ce, sur plusieurs aspects. En effet à travers un article intégré, ce mémoire a pour objectif principal de comparer les performances des méthodes d’ajustement basées sur le score de propension pour des groupes d’exposition ayant plus de 2 catégories. Plus spécifiquement, nous avons mené une étude de simulation pour comparer, dans un contexte de traitement multi-groupes (trois groupes d’exposition, puis cinq), une méthode d’appariement optimal non-biparti, la pondération par probabilité inverse de traitement (avec et sans troncation des poids, avec et sans stabilisation des poids) ainsi que les matching weights. Ces comparaisons ont été effectuées dans le cadre de simulation de type Monte Carlo, mais aussi avec une simulation de type plasmode. Dans cette dernière et grâce aux données sur les feux de forêts survenus en Alberta, Canada, nous avons comparé différentes méthodes d’interventions pour combattre les feux de forêts quant à leur capacité à prévenir l’accroissement des superficies brûlées après l’attaque initiale. La structure du document s’annonce comme suit. Le premier chapitre, nous permettra de faire une introduction au score de propension en présentant ses propriétés. Le deuxième chapitre sera constitué par un article qui a été soumis à la revue Computational Statistics & Data Analysis, et il y sera étudié l’analyse comparative des méthodes d’ajustement à travers différents scénarios. Cet article a été d’ailleurs retenu pour une présentation orale lors du congrès annuel organisé par la Société de Statistique du Canada qui s’est tenu en juin 2018 à l’Université McGill. Il a été aussi présenté lors de la journée de recherche du Centre interdisciplinaire en modélisation mathématique de l’Université Laval (CIMMUL) en mai 2018. Le mémoire se termine par une conclusion qui traite de l’interprétation des résultats obtenus dans l’article tout en les plaçant dans un contexte élargi et donnant ainsi une ouverture vers de nouvelles perspectives de recherches.

(13)

Chapitre 1

Le score de propension

1.1 Contexte, définitions et propriétés

Comme nous l’avons indiqué dans l’introduction, les études observationnelles sont, de nature, caractérisées par le débalancement des groupes de traitement au regard des valeurs des cova-riables qui leurs sont associées. Afin de pouvoir effectuer des comparaisons adéquates entre ces différents groupes de traitement nous devons corriger ce déséquilibre.

Pour définir et conceptualiser le score de propension, nous allons utiliser quelques notations similaires à celles adoptées par Imbens (2000). Considérons un échantillon de n observations et l’indice i est utilisé pour faire référence à un individu. Soit la notation Ti = t pour indiquer

que l’individu i a été assigné au traitement t, t ∈ T = {0, ..., k}. Soient X_i et Y_i pour désigner respectivement le vecteur des covariables et l’issue observée associés à l’individu i. La notation Yi(t) réfère à l’issue contrefactuelle ou l’issue qui se serait produite si l’individu i

avait été assigné au traitement T_i = t. Soit Di(t) la fonction indicatrice suivante :

Di(t) =

(1 si T_i = t, 0 sinon.

L’hypothèse de cohérence indique que lorsque le traitement t est assigné au sujet i, l’issue observée pour ce sujet correspond à son issue contrefactuelle associée au traitement t :

Di(t) = 1 ⇒ Yi = Yi(t).

Définition 1. Ignorabibilité faible, Imbens (2000)

L’assignation au traitement T est faiblement ignorable conditionnellement au vecteur de cova-riables Xi si

Di(t) ⊥⊥ Yi(t) | Xi, pour tout t ∈T ,

(14)

On lit « D_i(t) est indépendant à Yi(t), sachant Xi». Une écriture en termes de probabilité va

donner :

P(Di(t) | Yi(t), Xi) = P(Di(t) | Xi).

De cette écriture, nous tirons que disposer de l’information contenue dans les covariables X_iest suffisant pour obtenir l’information concernant le traitement. En effet après avoir connu Xi,

la connaissance de l’issue contrefactuelle Yi(t) n’apportera rien de nouveau sur la connaissance

de D_i(t).

Définition 2. Score de propension dans le cas d’une exposition binaire (k = 1) Rosenbaum & Rubin (1983), le cas d’une exposition multi-groupes (k > 1) Imbens (2000)

Le score de propension pour un individu i est défini comme étant la probabilité d’être exposé à un traitement particulier (Ti = t) connaissant ses covariables (Xi) :

psti(Xi) = P(Ti = t|Xi) = E[Di(t)|Xi]. (1.1)

Théorème 1. Propriété équilibrante du score de propension dans le cas d’une exposition bi-naire (k = 1) Rosenbaum & Rubin (1983), le cas d’une exposition multi-groupes (k > 1)

Imbens (2000)

Si l’assignation au traitement T est faiblement ignorable au regard des valeurs prises par les covariables, alors l’assignation au traitement sera faiblement ignorable au regard du score de propension :

Di(t) ⊥⊥ Yi(t) | Xi ⇒ Di(t) ⊥⊥ Xi | psti(Xi), pour tout t ∈T .

À lui seul, le score de propension nous fournit suffisamment d’information pour en apprendre davantage sur le traitement. En d’autres termes, après avoir connu le vecteur des scores de propension ps_i, la connaissance des covariables X_in’apportera rien de nouveau sur la connais-sance de D_i(t). Ce résultat met en évidence l’importance de l’équilibre entre les groupes de traitement en fonction du score de propension, car si cet équilibre n’est pas obtenu, l’ignora-bilité faible par rapport à l’issue potentielle ne tient assurément pas. De plus l’équilibre est vérifiable alors que l’ignorabilité faible ne l’est pas. Par contre, la relation est unidirectionnelle et non pas bi-directionnelle : l’équilibre ne garantit pas l’absence de confusion (ou le respect de l’ignorabilité).

Avant de faire la preuve du théorème1, nous allons introduire et démontrer un résultat impor-tant connu sous le nom de théorème de l’espérance totale. Ce résultat sera par la suite utilisé dans toutes démonstrations qui vont suivre.

Théorème 2. Théorème de l’espérance totale

(15)

où A est une variable aléatoire intégrable, B est une variable aléatoire quelconque, A et B sont définies sur le même espace probabilisé.

Démonstration. Théorème de l’espérance totale

Nous allons établir la preuve du théorème dans le cas de variables aléatoires A et B discrètes. La preuve pour le cas continu se fait de façon analogue, il suffit juste de remplacer l’opérateur P par R .

Soient a et b les valeurs prises par A et B, respectivement. EB(EA|B(A | B)) =

X

b

E_A|B(A | B = b)P(B = b) (par définition)

=X

b

X

a

aP(A = a | B = b)P(B = b) (par définition)

=X

b

X

a

aP(B = b | A = a)P(A = a) (probabilités conditionnelles)

=X b X a aP(A = a)P(B = b | A = a) =X a X b aP(A = a)P(B = b | A = a) =X a aP(A = a)X b P(B = b | A = a) (puisqueP

aaP(A = a) ne dépend pas de b)

=X

a

aP(A = a) (puisqueP

bP(B = b | A = a) = 1)

= E(A). (par définition)

Dans ce qui suit, nous adopterons l’écriture simplifiée suivante pour faire référence à l’équation

1.2 :

E(E(A | B)) = E(A). (1.3)

Dans cette écriture, la deuxième espérance à partir de la gauche ou la plus « à l’intérieur » réfère à l’espérance conditionnelle de A | B, alors que la première espérance à partir de la gauche ou celle située « à l’extérieur » est exprimée en fonction de B.

Démonstration. Théorème 1

Nous savons que Di(t) prend ses valeurs dans {0, 1} pour tout t ∈ T . Ainsi démontrer ce

théorème revient à prouver que

(16)

Pour ce faire nous allons montrer que le terme de gauche comme celui de droite de l’équation

1.4 sont égaux au score de propension et par conséquent nous établissons l’égalité. Commençons par le membre de gauche :

E[Di(t) | Xi, psti(Xi)] = E[Di(t) | Xi], (1.5)

ce résultat demeure vrai puisque ps_ti(Xi) est une fonction surjective de Xi. C’est-à-dire que

pour chaque élément de X on associe une et une seule valeur du score de propension ps, mais la réciproque est de toute évidence fausse : une même valeur du score de propension ps peut correspondre à différentes valeurs des covariables X. Ainsi, ps ne contient aucune information qui ne soit pas déjà contenue dans X, mais X contient potentiellement des informations qui ne sont pas dans ps. De ce fait, effectuer simultanément un conditionnement sur X_i et ps_ti(Xi)

devient redonnant et par conséquent nous devons nous limiter seulement à faire un condition-nement sur Xi en omettant psti(Xi) dans le second membre l’équation 1.5. En remplaçant la

définition du score de propension établie au niveau de l’équation 1.1dans l’équation1.5alors nous obtenons l’égalité suivante :

E[Di(t) | Xi, psti(Xi)] = psti(Xi). (1.6)

Reprenons le calcul à partir du membre de droite de l’équation 1.4. Dans cette partie, nous allons utiliser le théorème de l’espérance totale. En posant A = D_i(t) | psti(Xi) et B = Xi |

psti(Xi), nous obtenons :

E[Di(t) | psti(Xi)] = E{E[Di(t) | Xi, psti(Xi)] | psti(Xi)} (1.7)

En appliquant successivement les résultats des équations1.4et1.1dans le second membre de l’équation 1.7, nous obtenons :

E[Di(t) | psti(Xi)] = E{E[Di(t) | Xi] | psti(Xi)}

= E{psti(Xi) | psti(Xi)}

= psti(Xi)

(1.8)

D’après 1.6et1.8, nous pouvons conclure que

E[Di(t) | Xi, psti(Xi)] = E[Di(t) | psti(Xi)]

Théorème 3. Imbens (2000) Sous l’hypothèse de l’ignorabilité faible conditionnellement au vecteur de covariables X_i, nous avons

(17)

Autrement dit, pour tout t ∈T alors

E[Di(t) | Yi(t), psti(Xi)] = E[Di(t) | psti(Xi)]. (1.9)

Démonstration. Nous avons déjà montré, au niveau du calcul 1.8, que le second membre de l’équation 1.9est égal à ps_ti(Xi). Donc pour démontrer le théorème3, nous devons montrer,

dans ce qui suit, que le premier terme de l’équation 1.9vaut aussi psti(Xi).

E[Di(t) | Yi(t), psti(Xi)] = E{E(Di(t) | Yi(t), psti(Xi), Xi) | Yi(t), psti(Xi)} = E{E(Di(t) | Yi(t), Xi) | Yi(t), psti(Xi)} = E{E(Di(t) | Xi) | Yi(t), psti(Xi)} = E{psti(Xi) | Yi(t), psti(Xi)} = psti(Xi) (1.10)

D’après 1.8et1.10, nous pouvons conclure que

E[Di(t) | Yi(t), psti(Xi)] = E[Di(t) | psti(Xi)]

1.2 Validité de méthodes d’ajustement basées sur le score de

propension

1.2.1 Pondération

Avec la pondération par l’inverse de la probabilité de traitement, nous affectons à chaque indi-vidus de la base un poids égal à l’inverse de son score de propension. Le poids de pondération utilisé est de la forme :

wi = 1 Pk t=0Di(t)psti(Xi) = 1 f (T | Xi) .

Cette pondération va corriger ou réduire le débalancement en créant une pseudo-population à l’intérieur de laquelle les individus sont similaires au regard des covariables.

Dans ce qui suit, nous allons montrer que l’utilisation de cette pondération permet d’estimer l’effet causal moyen. Nous utilisons les résultats précédemment obtenus notamment ceux basés

(18)

sur le théorème de l’espérance totale. En effet, E[Di(t)Yiwi] = E Di(t)Yi f (T | Xi) (par définition) = E Di(t)Yi(t) f (T | Xi) (comme T = t alors Yi = Yi(t)) = E E Di(t)Yi(t) f (T | Xi) Yi(t), Xi

(théorème de l’espérance totale) = E Yi(t) f (T | Xi) E[Di(t) | Yi(t), Xi] = E Yi(t) f (T | Xi) E[Di(t) | Yi(t), Xi, psti(Xi)]

(puisque psti(Xi) est fonction de Xi)

= E Yi(t) f (T | Xi) E[Di(t) | Xi, psti(Xi)] (d’après le théorème3) = E Yi(t) f (T | Xi) psti(Xi) (d’après l’équation1.6) = E ( Yi(t) f (T | Xi) k X t=0 Di(t)psti(Xi) )

(du fait que un seul T = t est observé) = E Yi(t) f (T | Xi) f (T | Xi) (par définition) = E[Yi(t)].

Une démarche similaire doit être adoptée pour montrer la validité des autres méthodes basées sur la pondération notamment la pondération par poids stabilisés (preuve voir Hernán & Ro-bins,2006) et pour le matching weights (preuve voirYoshida et al.,2017). Les poids stabilisés sont obtenus comme suit :

swi=

P (Ti = t)

Pk

t=0Di(t)psti(Xi)

. Pour le matching weights, les poids sont donnés par :

mwi = min{ps0i(Xi), ps1i(Xi), ..., pski(Xi)} Pk t=0Di(t)psti(Xi) . 1.2.2 Appariement

L’objectif des méthodes d’appariement est d’associer des individus appartenant à des groupes de traitement différents, mais ayant des scores de propension similaires. La similarité observée dans le score de propension devrait aussi se traduire par des caractéristiques ou covariables similaires.

Introduisons la notation suivante :

Eti(p) = E[Yi | Di(t) = 1, psti(Xi) = p]

(19)

Finalement, l’effet causal moyen (E[Y (t)] − E[Y (l)]) est obtenu par la loi de l’espérance totale en calculant la moyenne de cette quantité (E_ti− E_li) sur l’ensemble des valeurs possibles du score de propension.

(20)

Chapitre 2

Comparing inverse probability of

treatment weighting methods and

optimal nonbipartite matching for

estimating the causal effect of a

multicategorical treatment

Authors

S. Arona Diop, Thierry Duchesne, Steven G. Cumming, Denis Talbot

Résumé

Des débalancements des covariables entre les groupes de traitement sont souvent présents dans les études observationnelles et peuvent biaiser les comparaisons entre les traitements. Ce biais peut notamment être corrigé grâce à des méthodes de pondération ou d’appariement. Ces méthodes de correction ont rarement été comparées dans un contexte de traitement à plusieurs catégories (>2). Nous avons mené une étude de simulation pour comparer une mé-thode d’appariement optimal non-biparti, la pondération par probabilité inverse de traitement ainsi qu’une pondération modifiée analogue à l’appariement (matching weights). Ces compa-raisons ont été effectuées dans le cadre de simulation de type Monte Carlo à travers laquelle une variable d’exposition à 3 groupes a été utilisée. Une étude de simulation utilisant des données réelles (plasmode) a été conduite et dans laquelle la variable de traitement avait 5 catégories. Parmi toutes les méthodes comparées, celle du matching weights apparaît comme étant la plus robuste selon le critère de l’erreur quadratique moyenne. Il en ressort, aussi,

(21)

que les résultats de la pondération par probabilité inverse de traitement peuvent parfois être améliorés par la troncation. De plus, la performance de la pondération dépend du niveau de chevauchement entre les différents groupes de traitement. La performance de l’appariement optimal non-biparti est, quant à elle, fortement tributaire de la distance maximale pour qu’une paire soit formée (caliper ). Toutefois, le choix du caliper optimal n’est pas facile et demeure une question ouverte. De surcroît, les résultats obtenus avec la simulation plasmode étaient positifs, dans la mesure où une réduction importante du biais a été observée. Toutes les mé-thodes ont pu réduire significativement le biais de confusion. Avant d’utiliser la pondération de probabilité inverse de traitement, il est recommandé de vérifier la violation de l’hypothèse de positivité ou l’existence de zones de chevauchement entre les différents groupes de traitement.

Abstract

Imbalances in covariates between treatment groups are frequent in observational studies and can lead to biased treatment comparisons. This bias can notably be corrected utilizing mat-ching or weighting methods. These approaches have seldom been compared in the context of a treatment with multiple categories (>2). We have conducted a simulation study to compare an optimal nonbipartite matching approach, an inverse probability of treatment weighting ap-proach as well as matching weights. Our simulations include a completely synthetic component based on a Monte Carlo design as well as a plasmode part that does not require knowing the model generating the data in its entirety. These comparisons are illustrated simultaneously through simulated data with 3 treatment categories and then using a real data example with 5 treatment groups. The inverse probability of treatment weighting and the matching weight approaches performed better than the optimal nonbipartite matching in scenarios with small or moderate bias. In scenarios with large bias, matching weights produced the best estimates while mixed results were obtained regarding the nonbipartite matching approach. Also, we found that the truncation and stabilization of weights can sometimes improve the perfor-mance of the weighting approach. Matching weights method was the most robust and yielded good estimates in all the scenario independent of the bias level. It is recommended, before using usual inverse probability weighting, to check the overlapping area between treatment groups or the violation of the positivity assumption.

Keyword

propensity score ; multiple treatment ; inverse probability of treatment weighting ; matching weight ; optimal nonbipartite matching ; plasmode simulation

(22)

2.1 Introduction

Many empirical studies seek to evaluate the effect of a treatment or intervention. This eva-luation can be attempted using randomized or observational experiments. In the former, pre-randomization characteristics are expected to be similar across treatment groups. As such, outcome differences can be causally attributed to the treatment. However, randomized trials are often difficult to realize because they may be unethical, impractical, or untimely (Hernan & Robins,2017). Thus, relying on observational experiments is often necessary. Unfortunately, imbalances in covariates between treatment groups are frequent in observational studies and can lead to biased treatment comparisons. This major challenge is known as confounding, in which differences in outcomes between treatment groups are due, at least in part, to systema-tic differences in baseline covariates between the treatment groups (Austin & Stuart, 2017). Without adjustment, the estimator based on simple averages in the treatment groups is bia-sed for estimating the average causal effect. This issue can notably be corrected by creating a balanced sample in which treatment groups are similar according to observed covariates, thus emulating a randomized design, in regards with the observed covariates.

Introduced by Rosenbaum & Rubin(1983), the propensity score is one of the most important causal inference concepts and plays a pivotal role in creating this balanced sample. Originally proposed in the context of a binary treatment, propensity scores were extended to treatment with more than two categories following Imbens(2000),Lechner(2001) andImai & Van Dyk

(2004)’s pioneer work. To define the propensity score, we introduce some notation. We consider a sample of n observations and let i refer to an individual. Let Ti = t denote the

multicate-gorical treatment indicating the membership of individual i in group t, t = 0, ..., k. Let Xi

be a set of covariates associated to individual i. The propensity score is defined as the condi-tional probability of assignment to a particular treatment (Ti = t) given a vector of observed

covariates (Xi) :

psti(Xi) = P (Ti = t|Xi),

and can be used to adjust for confounding variables in observational studies. According to

Rosenbaum & Rubin (1983), the balancing score property of propensity scores implies that if a group of observations are homogeneous in both psti(Xi) and certain chosen components of

Xi, it is still reasonable to expect balance on the other components of Xi within this refined

group of observations. In other words, the treatment decision is « random » conditional on the vector of propensity scores pst, with probabilities given by pst. As such, the treatment

decision does not depend on Xi conditional on pstand data can be analyzed as if they arose

from a conditional randomized controlled trial, where randomization is conditional on ps_t where randomization probabilities are equal to pst. Methods based on propensity scores such

as matching and weighting are commonly used to address confounding in the scientific lite-rature (e.g. Imai & Ratkovic,2014;Lopez et al.,2017). Briefly, matching consists in forming groups of observations having a similar value of their propensity scores, but different

(23)

treat-ments. Observations for which no match can be found are discarded. When using weighting methods, each observation receives a weight corresponding to the inverse of the propensity score corresponding to the treatment they received.

The performance of these methods in the case of a binary treatment have been compared in many studies (e.g. Austin & Mamdani, 2006;Austin et al.,2007;Austin,2009;Li & Greene,

2013). However, few studies have compared their performance against (k+1)-level treatments for k > 1. Govindasamy (2016) has conducted a simulation study based on Monte Carlo de-sign to compare the performance of propensity score techniques such as the triplet matched pairs method, stratification and propensity score weighting with three treatment groups un-der various circumstances (unun-der overt and hidden types of selection bias) and consiun-dering three sample sizes (200, 500 and 1000). The author concluded that these methods provide similar results under overt and hidden biases. Also, while they performed differently in the small sample size settings, their performances were similar for medium and large sample sizes.

Yoshida et al. (2017) examined the performance, in the three group setting, of the matching weights, the triplet matched pairs and the inverse probability of treatment weighting (IPTW) methods. They conducted a simulation study considering samples of 6000 subjects for all combinations of exposure prevalences 33 :33 :33, 10 :45 :45, 10 :10 :80 with weak and strong covariate treatment associations. The authors found that matching weights performed best, in terms of mean squared error, in all scenarios, and they also pointed-out that IPTW’s per-formance was highly dependent on covariate overlap which is a function of the strength of the association between covariates and treatments.

These two previous studies considered only the triplet matched pairs approach for matching. This approach forms triplets of observations having similar characteristics in terms of their co-variates but being in different treatment groups. Extending this approach to treatments with more than three categories is challenging. In fact, when treatment has multiple categories, it might be difficult to form groups of observations having different treatments but similar propensity scores. This approach may thus result in a loss of information due to unmatched observations being discarded. Moreover, all the aforementioned studies conducted simulation based exclusively on Monte Carlo design. It has been argued that models derived from such simulations are not sufficiently complex to reflect reality (Gadbury et al., 2008). Further li-mitations of Monte Carlo simulations include the lack of consensus on how model parameters should be tuned, as well as how to incorporate correlation structure among the covariates. And last but not least, these authors look only at the case of three treatment groups. This means that, to our knowledge, propensity score methods have never been tested in a context of more than three treatment groups.

The goal of the current study is to provide additional empirical evidence concerning the relative performance of confounding adjustment methods in a context of multiple treatment groups. We present a simulation study that compares an optimal nonbipartite matching approach, an

(24)

inverse probability of treatment weighting approach (with and without weight truncation, with and without stabilization of weights) as well as matching weights. Because previous simulations have demonstrated that matching weights outperforms tripartite matching, we do not consider this matching approach. Using Monte Carlo simulations, we first examine the case of three treatment groups. Although this context has been investigated before, some of the methods we consider have never been explored in this context. We then consider a plasmode simulation technique that does not require complete specification of the data generating process. In fact, simulating datasets with plasmode technique is an interesting alternative to Monte Carlo simulation. Plasmode (Gadbury et al.,2008) describes a data set that has been derived from real data but for which some truth is known because it has been generated. This plasmode simulation is derived from a study comparing five fire attack methods for preventing forest fire growth in Alberta, Canada. As such, this simulation allows us to compare the adjustment methods in the context of a treatment with 5 categories.

The remainder of this article is structured as follows. After a description of the compared propensity scores methods in Section 2, we present our simulation settings and results based on four Monte Carlo’s simulations and plasmode approach in Section 3. We conclude with a discussion in Section 4.

2.2 Methods

2.2.1 Optimal nonbipartite matching

Like all matching methods, optimal nonbipartite matching (ONBM) can be used to find the set of matches that minimize the sum of within-match distances based on a given distance matrix between all observations (Lu et al., 2011). The method seeks to find matched pairs of observations having similar characteristics in terms of their covariates (i.e., propensity scores) and being in different treatment groups. Based on this value, observations found to be sufficiently similar are then kept but others, for which a match cannot be found, are removed. This method estimates the average treatment effect for the matched observations.

Unlike the triplet matched pairs method, optimal nonbipartite matching has the advantage of potentially using more observations, since observations are matched pairwise. As a conse-quence of this, however, retained observations are not necessarily similar to each other between treatment groups overall, but only within a « type » of pair. For example, if we consider a treatment with three categories, 0, 1 and 2, pairs of type (0, 1), (0, 2) and (1, 2) are formed. Observations from treatment groups 0 and 1 within pairs of type (0, 1) are expected to be simi-lar by construction. However, observations from treatment groups 0 and 1 are not necessarily expected to be similar across pairs of type (0, 2) and (1, 2). As such, observations from groups 0 and 1 are not expected to be similar among all matched observations, but only to contain similar subsets. Intuitively, this approach thus emulates a conditional randomized controlled

(25)

trial, where randomization to treatment group is performed conditional on pair type. To ade-quately combine the information of all pairs, it is thus required to adjust for a « type of pair » variable.

2.2.2 Matching weighting

Recently proposed by Li & Greene (2013), matching weights (MW) is considered as an al-ternative to matching on the propensity score. The MW was extended to multiple treatment groups byYoshida et al. (2017). MW is a variant of IPTW and they both share the same de-nominator while MW’s numerator is the smallest propensity score between all the treatment groups : mwi = min{ps0i(Xi), ps1i(Xi), ..., pski(Xi)} Pk t=0I(Ti= t)psti(Xi) .

The estimands obtained with this method are asymptotically equivalent to those from exact matching across all treatment groups (Yoshida et al.,2017).

2.2.3 Inverse probability of treatment weighting

Inverse probability of treatment weighting (IPTW) creates a pseudo-population in which ba-seline covariates and treatment are not associated (Hernan & Robins,2017). IPTW is defined as wi= 1 Pk t=0I(Ti = t)psti(Xi) ,

where I(·) denotes the usual indicator function. One important assumption of IPTW is the positivity assumption which entails that ps_ti(Xi) > 0 for all t ∈ {0, .., k} and Xi such that

P (Xi) > 0. That is, each individual should have a positive probability of receiving each

possible treatment. Contrasts between treatments can then be performed on the weighted data as if they arose from a randomized trial, without further adjustment. For instance, the effect of treatment t as compared to treatment t0 is estimated by computing the simple weighted average of the outcome :

Pn

i=1wiYi(I(Ti = t) − I(Ti= t 0₎₎

Pn

i=1I(Ti ∈ {t, t0})wi

.

The IPTW can take very large values when the associated propensity score is close to zero. This situation known as practical or near violation of positivity assumption in causal inference occurs when certain subgroups in a sample rarely or never receive some treatments of interest (Petersen et al., 2012). Lee et al. (2011) indicated that weight truncation can improve the performance of propensity score weighting when the propensity scores are estimated using a logistic regression. Xiao et al.(2013) had also suggested that weights truncation can be used to increase IPTW’s performance when the positivity assumption is not guaranteed. Lee et al.

(26)

(2013) suggested to apply truncation at high percentiles such as the 99th or the 99.5th of the distribution of weights.

When groups are of unequal sizes, weight truncation is particularly susceptible to affect ob-servations from small treatment groups. Indeed, since the unconditional probability of their respective treatment is small, these observations are more susceptible to have larger weights. This may be seen as an undesirable property. To avoid this, we have also considered stabili-zing weights prior to truncation. Unlike standard IPTW, stabilized weights create an artificial population in which the relative importance of treatment groups remains the same as in the original data : swi= P (Ti = t) Pk t=0I(Ti = t)psti(Xi) .

Intuitively, weight stabilization thus removes the influence of treatment group size on the weights value.

2.3 A simulation study

2.3.1 Monte Carlo simulation

Simulation design and scenarios

For our data-generating process, we consider three covariates X₁, X₂, X₃ arbitrarily generated as follow : X1 ∼N (0, σ = √ 8), X2∼N (2X1, σ = √ 2) and X3 ∼Bernoulli(0.4).

The treatment variable T is simulated according to a multinomial logistic regression with three levels (0, 1 and 2), where the first is the reference category. The probability of membership in each group is given by :

π1 = 1 1 + exp(β12X1+ β22X2) + exp(β13X1+ β23X2) , π2 = exp(β12X1+ β22X2)π1, π3 = exp(β13X1+ β23X2)π1.

The outcome Y is generated from a normal distribution Y ∼N (µ3, 1),

where

µ3= γ0+ γ1X1+ γ2X2+ γ3X3+ λ1I(T = 1) + λ2I(T = 2).

We defined various simulation scenarios in terms of the choice of parameter vectors β and γ. The β parameters drive the strength of the association between the covariates and the

(27)

treatments, whereas the γ parameters represent the strength of the association between the covariates and the outcome. Parameters λ₁ and λ₂ represent the true treatment effects of level 1 and 2, respectively, and were set to either λ1 = 1.0 and λ2 = 1.5 or to λ1 = λ2 = 0. The

scenarios have been devised to feature varying levels of confounding bias.

Scenario 1 was built to present a very small amount of confounding. The following parameter values were used for generating the data : β12 = 0.000001, β22 = 0.00003, β13 = 0.0019,

β23= −0.00007, γ0= 0, γ1 = 0.01, γ2= 0.001, γ3 = 0.01.

Scenario 2 features a moderate level of confounding that is produced through a weak associa-tion between the confounders and the treatment and a relatively strong associaassocia-tion between the confounders and the outcome : β12= 0.003, β22= 0.01, β13= 0.08, β23= −0.02, γ0 = 0.1,

γ1 = 1, γ2 = 2, γ3 = 0.3.

Scenario 3 is similar to Scenario 2 in terms of confounding level but built through a relatively strong association between the confounders and the treatment and a weak association between the confounders and the outcome : β12 = 0.2, β22 = 0.05, β13 = 0.09, β23 = 0.2, γ0 = 0.04,

γ1 = 0.1, γ2 = 0.2, γ3 = 0.2.

For Scenario 4, a large level of bias has been introduced. Data generation has been performed with the following parameters : β₁₂ = 0.2, β22 = 0.3, β13 = 0.4, β23 = 0.5, γ0 = 0, γ1 = 1,

γ2 = 2, γ3 = 0.3.

For each of the scenarios, we simulated 1000 independent data sets of sample size 500, 1000 and 2000 under the above conditions. For each simulated data set, the propensity scores were first estimated using a main effects multinomial logistic regression model. The estimated propensity scores were then used to estimate the parameters λ1 and λ2 utilizing the ONBM, MW and

IPTW methods.

In the case of IPTW, we considered non-truncated weights, truncated weights based on the 0.5th and 99.5th percentiles and truncated weights based on the 1st and 99th percentiles, with and without stabilization of weights. For ONBM, to the best of our knowledge, there is not a formal rule of tuning the caliper. We computed Mahalanobis distance between the vectors of the 2 first propensity scores (since ps2i(Xi) = 1 −P1_t=0psti(Xi)). We considered

caliper values ranging from the minimum to the maximum distance between paired subjects with increment of 0.01 and present detailed results for caliper 0.05, 0.1 as well as the caliper that minimizes the maximum of the mean square error (MSE) of the two treatment parameter estimators over all simulations. The latter caliper is denoted the « optimal » caliper henceforth. We note, however, that there exists various ways in which a caliper might be optimal (e.g., if it minimizes the average of the MSE of the two estimators instead of the maximum). Graphs depicting the MSE of the parameter estimator according to the caliper value are also provided. We computed the following measures to assess and compare the performance of these three

(28)

methods : mean, bias, standard deviation (Std), MSE and the proportion of the times that 95% confidence intervals included the true value of the treatment effect (Coverage CI). To assess bias levels, we calculated a crude or unadjusted estimate of treatment effect. For each scenario, we also plotted the overlap on estimated propensity scores before adjustment, using a single simulated sample of 2000 observations.

Simulation results

Tables 2.1,2.2 ,2.3 and2.4compare the aforementioned performance criteria across ONBM, MW, IPTW and stabilized weights methods. We only present and discuss simulation results obtained with true treatment effect λ₁ = 1.0 and λ2= 1.5. The results for the null treatment

effect (λ1 = λ2 = 0), which do not differ substantially, are presented inAppendices B, C,D

and E.

By construction, Scenario 1 guarantees an adequate unadjusted estimate of treatment effect that is almost equal to the true value. In fact,Figure 2.1depicts a near-perfect overlap of the propensity scores across the treatment groups, which indicates that they are very similar in terms of their covariate distributions. As such, all methods yield estimates with bias close to 0 and coverage of confidence intervals close to the nominal rate (Tableau 2.1). However, there is a slight advantage for MW and usual weighting approaches in terms of MSE as compared to ONBM, even when an « optimal » value of the caliper is chosen. This difference attenuates as sample size grows. A more detailed analysis of the relation between caliper and MSE in Scenario 1 highlights that ONBM accuracy increases with caliper values (see Figure 2.2). The results for Scenario 2 are presented in Tableau 2.2. A moderate amount of bias is present in this scenario, that is, unadjusted estimates are roughly twice as large as the true parameter values. All the methods considered properly correct this bias. The coverage probabilities of 95% confidence intervals are however too conservative. As in Scenario 1, weighting methods outperform ONBM in terms of MSE, even when the caliper is « optimally » chosen. This is mainly due the high variance of estimates produced by the MW. Also IPTW has a slightly smaller MSE than MW. Figure 2.3 first shows a decreasing trend of the MSE to calipers of roughly 0.51 and then increases very slightly. The « optimal » caliper varies with sample size. For example the optimal was 0.51, 0.35 and 0.42 for data size equal to 500, 1000 and 2000, respectively.

Although Scenario 3 had similar levels of confounding as Scenario 2, the differing associations between confounders and treatments led to a very small overlapping area between treated and control groups (Figure 2.4). Despite the low degree of similarity between the propensity scores, all methods correct this bias, at all sample sizes considered (Tableau 2.3). Matching weights method gives the best results, that is, IPTW’s MSE without truncation is at least twice as large as MW’s, whereas IPTW’s MSE with truncation is at least 50% larger than

(29)

Table 2.1. Estimate of treatment effect in Scenario 1. True effects are λ1= 1 and λ2 = 1.5.

Sample size Approach Mean Std Bias MSE Coverage IC b

λ1 λb2 λb1 λb2 λb1 λb2 λb1 λb2 λb1 λb2

500

Crude Estimation 0.994 1.497 0.107 0.114 -0.006 -0.003 0.011 0.013 0.955 0.938 Optimal Nonbipartite Matching

Caliper 0.05 1.008 1.502 0.413 0.420 0.008 0.002 0.171 0.176 0.951 0.942 Caliper 0.10 1.006 1.499 0.228 0.244 0.006 -0.001 0.052 0.060 0.964 0.941 « Optimal » caliper 3.00 0.995 1.495 0.125 0.130 -0.005 -0.005 0.016 0.017 0.945 0.942 Matching Weights 0.994 1.497 0.108 0.115 -0.006 -0.003 0.012 0.013 0.957 0.943 IPTW 99th percent 0.994 1.496 0.108 0.115 -0.006 -0.004 0.012 0.013 0.954 0.936 99.5th percent 0.994 1.496 0.108 0.115 -0.006 -0.004 0.012 0.013 0.954 0.936 No truncation 0.994 1.496 0.108 0.115 -0.006 -0.004 0.012 0.013 0.955 0.936 Stabilized Weights 99th percent 0.994 1.496 0.108 0.115 -0.006 -0.004 0.012 0.013 0.954 0.936 99.5th percent 0.994 1.497 0.108 0.115 -0.006 -0.003 0.012 0.013 0.954 0.936 1000 Crude Estimation 1.004 1.504 0.079 0.081 0.004 0.004 0.006 0.007 0.955 0.945 Optimal Nonbipartite Matching

Caliper 0.05 1.014 1.510 0.210 0.217 0.014 0.010 0.044 0.047 0.948 0.943 Caliper 0.10 1.009 1.509 0.135 0.133 0.009 0.009 0.018 0.018 0.940 0.944 « Optimal » caliper 3.00 1.004 1.504 0.091 0.093 0.004 0.004 0.008 0.009 0.950 0.939 Matching Weights 1.004 1.503 0.079 0.081 0.004 0.003 0.006 0.007 0.951 0.944 IPTW 99th percent 1.004 1.503 0.079 0.081 0.004 0.003 0.006 0.007 0.953 0.941 99.5th percent 1.004 1.503 0.079 0.081 0.004 0.003 0.006 0.007 0.953 0.941 No truncation 1.004 1.503 0.079 0.081 0.004 0.003 0.006 0.007 0.953 0.941 Stabilized Weights 99th percent 1.004 1.503 0.079 0.081 0.004 0.003 0.006 0.007 0.953 0.941 99.5th percent 1.004 1.503 0.079 0.081 0.004 0.003 0.006 0.007 0.953 0.941 2000 Crude Estimation 1.002 1.502 0.054 0.055 0.002 0.002 0.003 0.003 0.952 0.938 Optimal Nonbipartite Matching

Caliper 0.05 1.003 1.503 0.116 0.116 0.003 0.003 0.013 0.014 0.948 0.946 Caliper 0.10 1.004 1.505 0.081 0.080 0.004 0.005 0.007 0.006 0.944 0.952 « Optimal » caliper 3.00 1.003 1.505 0.064 0.064 0.003 0.005 0.004 0.004 0.951 0.959 Matching Weights 1.002 1.502 0.055 0.055 0.002 0.002 0.003 0.003 0.949 0.937 IPTW 99th percent 1.002 1.502 0.054 0.055 0.002 0.002 0.003 0.003 0.950 0.936 99.5th percent 1.002 1.502 0.054 0.055 0.002 0.002 0.003 0.003 0.950 0.936 No truncation 1.002 1.502 0.054 0.055 0.002 0.002 0.003 0.003 0.949 0.936 Stabilized Weights 99th percent 1.002 1.502 0.054 0.055 0.002 0.002 0.003 0.003 0.950 0.936 99.5th percent 1.002 1.502 0.054 0.055 0.002 0.002 0.003 0.003 0.950 0.936

(30)

λ1 λb2 λb1 λb2 λb1 λb2 λb1 λb2 λb1 λb2

500

Crude Estimation 2.061 2.961 1.542 1.618 1.061 1.461 3.505 4.755 0.904 0.844 Optimal Nonbipartite Matching

Caliper 0.05 1.148 1.511 2.067 2.029 0.148 0.011 4.293 4.117 0.999 0.999 Caliper 0.10 1.037 1.451 1.096 1.127 0.037 -0.049 1.202 1.271 0.999 0.999 « Optimal » caliper 0.51 1.085 1.582 0.585 0.603 0.085 0.082 0.349 0.371 1 1 Matching Weights 0.984 1.497 0.220 0.217 -0.016 -0.003 0.048 0.047 1 1 IPTW 99th percent 1.011 1.525 0.200 0.197 0.011 0.025 0.040 0.039 1 1 99.5th percent 1.027 1.547 0.203 0.200 0.027 0.047 0.042 0.042 1 1 No truncation 0.995 1.501 0.201 0.197 -0.005 0.001 0.041 0.039 1 1 Stabilized Weights 99th percent 1.013 1.526 0.200 0.196 0.013 0.026 0.040 0.039 1 1 99.5th percent 1.030 1.549 0.203 0.200 0.030 0.049 0.042 0.042 1 1 1000 Crude Estimation 1.910 3.000 1.133 1.118 0.910 1.500 2.112 3.501 0.875 0.730 Optimal Nonbipartite Matching

Caliper 0.05 0.945 1.442 1.027 1.062 -0.055 -0.058 1.059 1.132 1 0.999 Caliper 0.10 0.985 1.495 0.608 0.622 -0.015 -0.005 0.370 0.387 1 1 « Optimal » caliper 0.35 1.031 1.557 0.372 0.359 0.031 0.057 0.139 0.132 1 1 Matching Weights 1.004 1.496 0.128 0.125 0.004 -0.004 0.016 0.016 1 1 IPTW 99th percent 1.015 1.519 0.119 0.122 0.015 0.019 0.014 0.015 1 1 99.5th percent 1.028 1.542 0.124 0.126 0.028 0.042 0.016 0.018 1 1 No truncation 1.001 1.495 0.118 0.120 0.001 -0.005 0.014 0.014 1 1 Stabilized Weights 99th percent 1.016 1.521 0.119 0.122 0.016 0.021 0.014 0.015 1 1 99.5th percent 1.030 1.545 0.124 0.126 0.030 0.045 0.016 0.018 1 1 2000 Crude Estimation 1.941 3.034 0.776 0.766 0.941 1.534 1.488 2.940 0.788 0.507 Optimal Nonbipartite Matching

Caliper 0.05 1.009 1.510 0.533 0.531 0.009 0.010 0.284 0.282 1 0.999 Caliper 0.10 1.020 1.521 0.326 0.333 0.020 0.021 0.106 0.111 1 1 « Optimal » caliper 0.42 1.040 1.551 0.224 0.225 0.040 0.051 0.052 0.053 1 1 Matching Weights 1.003 1.503 0.081 0.081 0.003 0.003 0.007 0.007 1 1 IPTW 99th percent 1.016 1.526 0.075 0.076 0.016 0.026 0.006 0.007 1 1 99.5th percent 1.030 1.550 0.079 0.079 0.030 0.050 0.007 0.009 1 1 No truncation 1.002 1.501 0.074 0.077 0.002 0.001 0.005 0.006 1 1 Stabilized Weights 99th percent 1.017 1.528 0.075 0.077 0.017 0.028 0.006 0.007 1 1 99.5th percent 1.031 1.552 0.079 0.079 0.031 0.052 0.007 0.009 1 1

(31)

(a) (b)

(c)

Figure 2.1. Overlap on fitted propensity scores before adjustment in Scenario 1 using 2000 observations. Legend : solid line refers to the treatment group and dashed line to the control groups. The dotted red line represents the median of the treatment group.(a): Group 1 refers to treatment and Groups 2-3 represent the controls groups. (b): Group 2 refers to treatment and Groups 3-1 represent the controls groups. (c) : Group 3 refers to treatment and Groups 1-2 represent the controls groups.

MW’s. Similar differences between ONBM and MW’s MSE are also observed. An analysis of ONBM’s performance according to the caliper shows a rapid drop of the MSE up to calipers around 0.15, followed by a roughly linear increase in MSE whose slope is sharper for larger sample sizes (Figure 2.5).

Scenario 4 exhibits a large imbalance in covariates. Figure 2.6 shows the lack of overlapping area in the propensity scores between treatment groups. As a result, only the MW method recovers the true parameter in this Scenario (Tableau 2.4). The MW approach succeeds in eliminating the bias. The crude estimates confirm that a very large amount of bias is present. A substantial reduction of the bias is observed after adjustment utilizing the ONBM. Standard IPTW adjustment yields some reduction of the bias, but substantial bias remains, even at large sample sizes. Moreover, the coverage probabilities of 95% confidence intervals are largely inferior to their nominal level, indicating that invalid inferences are produced. Truncation of the weight is also unable to improve the results. IPTW’s MSE after truncation become higher than those without truncation. Figure 2.7 illustrates that balance in propensity scores is improved after IPTW adjustment, but that some amount of imbalance remains. However in terms of MSE, the results do not lead to clear trends in identifying the best approach. In fact, for estimating λ1, ONBM has a better performance than MW in terms of MSE, while matching

(32)

(a) Sample Size = 500 (b) Sample Size = 1000

(c) Sample Size = 2000

Figure 2.2. MSE of optimal nonbipartite matching as a function of caliper parameter in Scenario 1. Legend : solid line (λ1), dashed line (λ2)

(33)

(a) (b)

(c)

Figure 2.4. Overlap on fitted propensity score before adjustment in Scenario 3 using 2000 observations. Legend : solid line refers to the treatment group and dashed line to the control groups. The dotted red line represents the median of the treatment group.(a): Group 1 refers to treatment and Groups 2-3 represent the controls groups. (b): Group 2 refers to treatment and Groups 3-1 represent the controls groups. (c) : Group 3 refers to treatment and Groups 1-2 represent the controls groups.

(34)

λ1 λb2 λb1 bλ2 bλ1 bλ2 bλ1 bλ2 bλ1 bλ2

500

Crude Estimation 1.947 3.087 0.179 0.174 0.947 1.587 0.929 2.548 0 0 Optimal Nonbipartite Matching

Caliper 0.05 1.005 1.498 0.459 0.441 0.005 -0.002 0.211 0.195 0.986 0.990 Caliper 0.10 1.027 1.528 0.262 0.269 0.027 0.028 0.069 0.073 0.994 0.994 « Optimal » caliper 0.23 1.057 1.591 0.186 0.181 0.057 0.091 0.038 0.041 0.991 0.986 Matching Weights 0.995 1.497 0.140 0.141 -0.005 -0.003 0.020 0.020 0.982 0.985 IPTW 99th percent 1.069 1.625 0.170 0.191 0.069 0.125 0.034 0.052 0.982 0.951 99.5th percent 1.108 1.693 0.161 0.177 0.108 0.193 0.037 0.069 0.978 0.912 No truncation 1.001 1.500 0.227 0.281 0.001 0.0001 0.051 0.079 0.988 0.973 Stabilized Weights 99th percent 1.070 1.626 0.170 0.192 0.070 0.126 0.034 0.053 0.982 0.948 99.5th percent 1.109 1.696 0.160 0.177 0.109 0.196 0.038 0.070 0.979 0.904 1000 Crude Estimation 1.947 3.089 0.132 0.120 0.947 1.589 0.915 2.540 0 0 Optimal Nonbipartite Matching

Caliper 0.05 1.001 1.501 0.250 0.231 0.001 0.001 0.063 0.054 0.989 0.994 Caliper 0.10 1.022 1.532 0.166 0.156 0.022 0.032 0.028 0.025 0.991 0.992 « Optimal » caliper 0.16 1.039 1.574 0.129 0.128 0.039 0.074 0.018 0.022 0.987 0.987 Matching Weights 0.998 1.497 0.101 0.093 -0.002 -0.003 0.010 0.009 0.982 0.988 IPTW 99th percent 1.068 1.630 0.115 0.123 0.068 0.130 0.018 0.032 0.982 0.938 99.5th percent 1.104 1.690 0.111 0.119 0.104 0.190 0.023 0.050 0.967 0.851 No truncation 1.001 1.512 0.147 0.165 0.001 0.012 0.022 0.028 0.987 0.988 Stabilized Weights 99th percent 1.070 1.631 0.115 0.124 0.070 0.131 0.018 0.032 0.979 0.937 99.5th percent 1.105 1.693 0.110 0.119 0.105 0.193 0.023 0.051 0.962 0.843 2000 Crude Estimation 1.945 3.089 0.089 0.087 0.945 1.589 0.900 2.533 0 0 Optimal Nonbipartite Matching

Caliper 0.05 1.012 1.514 0.139 0.130 0.012 0.014 0.020 0.017 0.992 0.999 Caliper 0.10 1.023 1.542 0.094 0.094 0.023 0.042 0.009 0.011 0.992 0.990 « Optimal » caliper 0.11 1.025 1.547 0.097 0.091 0.025 0.047 0.010 0.011 0.990 0.986 Matching Weights 0.998 1.501 0.068 0.069 -0.002 0.001 0.005 0.005 0.976 0.987 IPTW 99th percent 1.069 1.628 0.079 0.095 0.069 0.128 0.011 0.025 0.979 0.857 99.5th percent 1.105 1.691 0.075 0.088 0.105 0.191 0.017 0.044 0.941 0.647 No truncation 1.002 1.505 0.104 0.129 0.002 0.005 0.011 0.017 0.996 0.983 Stabilized Weights 99th percent 1.070 1.629 0.078 0.095 0.070 0.129 0.011 0.026 0.980 0.854 99.5th percent 1.106 1.693 0.075 0.088 0.106 0.193 0.017 0.045 0.942 0.632

(35)

weights has the better performance for estimating λ₂. Arguably, if focus is on reducing bias, instead of compromising between bias and variance, then MW would be preferred. In regards with the relationship between ONBM’s MSE and caliper, Figure 2.8 presents similar trends to those observed in Scenario 3.

λ1 bλ2 bλ1 bλ2 bλ1 bλ2 bλ1 bλ2 bλ1 bλ2

500

Crude Estimation 14.126 24.881 1.164 1.042 13.126 23.381 173.648 547.739 0 0 Optimal Nonbipartite Matching

Caliper 0.05 1.181 1.919 0.528 0.578 0.181 0.419 0.312 0.510 1 1 Caliper 0.10 1.257 2.061 0.396 0.449 0.257 0.561 0.223 0.517 1 1 « Optimal » caliper 0.07 1.224 1.976 0.436 0.500 0.224 0.476 0.241 0.476 1 1 Matching Weights 0.972 1.476 0.544 0.570 -0.028 -0.024 0.297 0.325 0.993 0.996 IPTW 99th percent 5.832 9.339 2.288 2.523 4.832 7.839 28.585 67.810 0.587 0.155 99.5th percent 6.676 10.600 1.722 1.916 5.676 9.100 35.184 86.489 0.333 0.030 No truncation 3.569 5.884 5.247 5.768 2.569 4.384 34.131 52.489 0.741 0.509 Stabilized Weights 99th percent 5.876 9.751 2.561 2.574 4.876 8.251 30.336 74.707 0.611 0.121 99.5th percent 6.817 11.200 1.964 1.960 5.817 9.700 37.690 97.933 0.393 0.016 1000 Crude Estimation 14.021 24.871 0.830 0.713 13.021 23.371 170.245 546.715 0 0 Optimal Nonbipartite Matching

Caliper 0.05*** _1.129 _1.839 _0.281 _0.307 _0.129 _0.339 _0.096 _0.209 ₁ ₁ Caliper 0.10 1.207 1.966 0.235 0.278 0.207 0.466 0.098 0.295 1 1 Matching Weights 0.999 1.496 0.370 0.389 -0.001 -0.004 0.137 0.151 0.996 0.998 IPTW 99th percent 5.626 9.178 1.555 1.713 4.626 7.678 23.816 61.886 0.379 0.042 99.5th percent 6.469 10.445 1.231 1.333 5.469 8.945 31.421 81.794 0.087 0 No truncation 2.639 4.797 4.541 5.186 1.639 3.297 23.305 37.767 0.798 0.564 Stabilized Weights 99th percent 5.650 9.638 1.761 1.745 4.650 8.138 24.726 69.274 0.430 0.027 99.5th percent 6.579 11.083 1.393 1.355 5.579 9.583 33.060 93.667 0.147 0 2000 Crude Estimation 14.081 24.879 0.569 0.503 13.081 23.379 171.434 546.817 0 0 Optimal Nonbipartite Matching

Caliper 0.05*** _1.117 _1.804 _0.179 _0.199 _0.117 _0.304 _0.046 _0.132 ₁ ₁ Caliper 0.10 1.187 1.907 0.160 0.197 0.187 0.407 0.060 0.205 1 1 Matching Weights 1.017 1.499 0.246 0.260 0.017 -0.001 0.061 0.068 0.997 1 IPTW 99th percent 5.601 9.113 1.126 1.263 4.601 7.613 22.435 59.548 0.121 0.005 99.5th percent 6.458 10.399 0.845 0.948 5.458 8.899 30.505 80.090 0.003 0 No truncation 2.497 4.138 4.085 4.936 1.497 2.638 18.929 31.318 0.748 0.569 Stabilized Weights 99th percent 5.636 9.578 1.306 1.303 4.636 8.078 23.198 66.947 0.189 0.004 99.5th percent 6.585 11.051 0.990 0.983 5.585 9.551 32.175 92.190 0.014 0 ***_{« Optimal » caliper.} 2.3.2 Plasmode simulation Plasmode datasets

We used data on forest fires in Alberta, Canada, as a basis for our plasmode simulation. Data are produced and published online by the Government of Alberta (Wildfire Management Branch - Alberta Agriculture and Forestry,Agriculture & Forestry(2015)). The purpose of the analysis is to compare various interventions for fighting wildfires in Alberta on their probability of preventing the fire to grow after its initial assessment. A more detailed explanation of the

(36)

(a) (b)

(c)

Figure 2.6. Overlap on fitted propensity scores before adjustment in Scenario 4 using 2000 observations. Legend : solid line refers to the treatment group and dashed line to the control groups. The dotted red line represents the median of the treatment group.(a): Group 1 refers to treatment and Groups 2-3 represent the controls groups. (b): Group 2 refers to treatment and Groups 3-1 represent the controls groups. (c) : Group 3 refers to treatment and Groups 1-2 represent the controls groups.

problem and of the variables is available in Tremblay et al.(2018). Also see references therein for more subject matter details. We considered only fires caused by lightning from 1996 to 2014. Observations for which size at « being held » was smaller than the size at initial attack were also removed. « Being held » is defined by Tremblay et al. (2018) as a state when no further increase in size is expected. Each fire was associated with an ecological region to account for large scale geographic variation in e.g. climate, fuels, or economic value of forests. Finally, we created a variable that counts the number of fires active at the time of initial assessment of each fire. Given that resources available for fire-attack are limited, this variable is also susceptible to influence decision making. Some levels of categorical variables having few instances were dropped, and observations in those categories were removed. The final database contained 8591 observations, 7 covariates (see Appendix F) and the treatment variable is the method of intervention used by the firefighters to suppress the wildfire. The categories of this variable were heli-attack crew with helicopter but no rappel capability (HAC1H ; 53.8%), heli-attack crew with helicopter and rappel capability (HAC1R ; 15.8%), fire-attack crew with or without a helicopter and no rappel capability (HAC1F ; 6.3%), Air tanker (15.3%) and Ground-based action (8.8%). A binary response variable was 1 or 0 depending as the fire did or did not increase in size between initial attack and « being held » (n = 1982 and 6645, respectively).