Semester1,Academicyear2016-2017 [email protected] R´emiBazillier Programmeevaluation,Matching,RDD

(1)

Empirical Methods in Development Economics Universit´e Paris 1 Panth´eon Sorbonne

Programme evaluation, Matching, RDD

R´emi Bazillier

(2)

I Programme evaluation: identify the causal effects of a

‘treatment’ or a ‘programme’

I For an individual i with observed characteristicsx_i assigned to treatment w ∈ {0, 1} and with observed outcomeyi,what would individual i have looked like if they had received treatment w’ instead?

I See The Roy-Rubin model (ch. 1)

(3)

I The evaluation problem can be divided in two distinct parts:

I A set ofpotential outcome

I Anassignement mechanismthat assigns each unit to one and only one treatment at each point in time

I The fundamental problem of causal inference(Holland 1986):

I At any point in time, only one of these potential outcomes will actually be observed, depending on the assignment mechanism

I For individual whom we observe under treatment, we have to form an estimate of what they would have looked like if they

(4)

treated and non-treated can be composed into the difference outcomes for those who are treated (the treatment effect on the treated , ATT) plus the difference in the potential outcome without treatment between those who actually received treatment and those who did not (the selection bias)

I When potential outcomes are uncorrelated with treatment status, the selection bias is equal to zero (the case of randomized experiments)

(5)

I Average Treatment Effect (ATE):

τ_ATE =E(τ) =E[Y(1)−Y(0)].

I Average Treatment Effect on the Treated (ATT):

τ_ATT =E[Y(1)−Y(0)|T =1].

I τ_ATE =τ_ATT when there is no selection bias

I This holds when the treatment assignment is uncorrelated with potential outcomes (the hypothesis ofunconditional unconfoundedness)

I It implies thezero conditional mean assumption(see ch. 2)

I Unconfoundedness will hold if we carry out an experiment by

(6)

I The simple difference-in-means estimatorwill be biased if there is selection into treatment

I However, it is possible to obtain an unbiased estimate of the average treatment effect if the selection is on observables

I A simple example: differences in achievement between students in private and public schools

I Obviously: the ‘treatment’ (attending a private school) will be confoundedby other factors (like the wealth of the household)

I The conditional independence assumption (CIA) or gnorability of treatment’ or ’unconfoundedness”

I Assumption that the potential outcomes are independent of actual treatment status,conditionalon a vector of observables

I In that case, the ATE and ATT are the same and we can identify these effects under theoverlap assumption

(7)

I Overlap if we have both treated and untreated individuals, conditional on the vector of observables

I To obtain the ATE, we need to be able to evaluate:

E(Achievement|(Wealth=high),(School =Private) E(Achievement|(Wealth =high),(School =State) E(Achievement|(Wealth =low),(School=Private)

E(Achievement|(Wealth =low),(School =State)

I But if no household with low wealth goes to private school, we have

(8)

(Rosenbaum and Rubin, 1983):

I The combination of the assumptions of unconfoundedness and overlap

I In that case, we can estimate the average treatment effect without bias by regressing the outcome variabley on the treatment dummyw and the observable variable x

I It implies that there is no omitted variable! (not very likely..)

I If we want to allow for the possibility that the effect of the observable differs across the groups, we would need to include in the regression the interaction term

(9)

2.1. Matching based on a multi linear regression...

I Dehejia and Wahba (1999) want to estimate the impact of a labor training program on post intervention income levels.

I More precisely, they aim at showing that a matching approach offers estimates that are close to those stemming from a randomized experiment...

I ... provided:

I unconfoundednessis credible (the dataset is rich: it allows to control for key pre-intervention variables or variables that are

(10)

I To achieve their objective, Dehejia and Wahba (1999) rely on two groups of individuals:

I those who are treated (data for this group stem from Lalonde (1986)’s experimental dataset called NSW (National Supported Work); this dataset is based on a randomized experiment);

I those who are untreated (data related to this group notably stem from the PSID (Panel Study of Income Dynamics): they areobservational).

I Dataset NSW-PSID is a combination of these two sources of data.

I The outcome of interest is the earnings of individuals in 1978 (post intervention) in terms of 1978 dollars.

(11)

I Lalonde (1986) estimates the following equation, based on NSW data (i.e. data stemming from a randomized

experiment):

Y⁷⁸=α+τT +u. (1)

I What does τstands for?

(12)

I Given that NSW data are experimental, one can write:

I E[Y|T =1] =E[Y(1)] =α+τ(the average outcome for treated individuals when the treatment is random);

I E[Y|T =0] =E[Y(0)] =α(the average outcome for untreated individuals when the treatment is random).

I Indeed, E[u|T] =0.

(13)

I Hence,

τ=E[Y(1)]−E[Y(0)] =E[Y(1)−Y(0)] =ATE.

I Lalonde (1986) finds an ATE of 1,794 dollars (OLS estimation) that is significant at the 1% confidence level.

I Let’s now rely on the NSW-PSID dataset and estimate Equation (1) with OLS.

(14)

I Coefficientτ is equal to -15204.78 (and significant at the 1%

confidence level)!

I Do you see why?

(15)

I This is because the assignment of individuals to the treated and to the non treated group is not random anymore!

I We are now working on observational, not on experimental data!

I Put differently, there are characteristics (such as being unemployed before the intervention) that:

I positively impact the probability of being enrolled in the labor training program;

(16)

I Therefore, even in the absence of the labor training program, those who enrolled in this program would anyway have ended up with lower post treatment earnings as compared to those who did not enroll.

I Hence, the selection bias captured by

E[Y(0)|T =1]−E[Y(0)|T =0]

is negative... it runs against us finding a positive impact of the labor training program on post treatment earnings.

I How could we reduce this bias?

(17)

I One could introduce in Equation (1) the variables that likely influence both treatment assignment and potential outcomes (we denote this set of variables byX):

Y⁷⁸ =α+τT +X⁰β+u. (2)

I Assume that you convince the reader that all of these variables are included, and hence that the unconfoundedness assumption is satisfied.

(18)

I We can write:

I E[Y|T =1,X] =E[Y(1)|X] =α+τ (the average outcome for treated individuals when the treatment is random, conditional on observables);

I E[Y|T =0,X] =E[Y(0)|X] =α(the average outcome for untreated individuals when the treatment is random, conditional on observables).

I Indeed, E[u|T,X] =0.

(19)

I Hence,

τ=E[Y(1)|X]−E[Y(0)|X] =E[Y(1)−Y(0)|X] =CATE.

I CATE stands for Conditional Average Treatment Effect.

I What is the OLS estimate of τ equal to when Xincludes a large set of pre-intervention variables?

(20)

(21)

I Coefficientτ is now equal to +751.9464.

I We get closer to Lalonde’s estimate.

I But we are not quite there: the order of magnitude is much lower and not statistically significant...

(22)

2.2. ... is problematic

I This is because common supportusually does not hold when one implements matching based on a multi linear regression.

I In the NSW-PSID dataset that contains only few observations (N=2675), thecommon support assumption already fails to hold when we control for only one critical observable:

education.

I Let’s type the following command on Stata:

twoway (scatter re78 education if treat==0, mcolor(black)) (scatter re78 education if treat==1, mcolor(red)), legend(order(1 "not

trained" 2 "trained"))

(23)

(24)

I It is striking that there is no common support for many values of the variable “education”.

I This is the case when education is equal to 0, 2, 3, and 17.

I What does τcapture when we only control for education in Equation (2)?

(25)

I For each value of the variable “education”, Stata (or any statistical software) computes the difference in outcome (re78) between those who are treated and those who are not treated.

I The parameter τcaptures the average of these differences.

I But how can these differences be meaningfully computed for values of the variable “education” where there are no treated observations?

(26)

I The failure of the common supportassumption leads to:

I a biased estimate of the treatment effect (the difference in outcome between those who are treated and those who are not treated cannot be always computed);

I a large variance of this estimate (since there are, in some instances, no or very few observation(s) to construct the counterfactual).

(27)

I Obviously, ensuring that the common supportassumption holds is even more difficult when one controls for a set of observables.

I Assume that one controls for education and married.

I The variable “education” ranges from 0 to 17 while the variable “married” is a dummy.

I This means that:

there are now 18*2=36 different categories of individuals

(28)

3.1. Three, not two identifying assumptions

I We’ve just seen that implementing a matching strategy based on a multi linear regression is clearly not a good approach.

I The alternative is to rely on a balancing score matching approach.

I This approach has been defined by Rosenbaum and Rubin (1983):

I it does not consist in matching treated and non treated individuals based on a set of observablesX;

I it entails matching treated and non treated individuals based on only one variable called abalancing score.

(29)

I A balancing score is a function of X, denoted byb(X) that must satisfy the following balancingassumption:

TqX|b(X).

I This assumption asserts that, conditional on the balancing score, the set of observables Xare independent of assignment to the treatment.

I Put differently, for observations with the same balancing

(30)

I The balancingassumption is important because it ensures that one only needs to match treated and non treated individuals based on the balancing score (i.e. matching these individuals on a set of observables is not required anymore).

I Rosenbaum and Rubin (1983) show that a possible balancing score is the propensity score matching.

I The propensity score is the probability for an individual to participate in a treatment given his observed characteristicsX.

I It is denoted by

P(T =1|X) =P(X).

(31)

I An approach that consists in matching treated and non treated individuals based on the propensity score is called propensity score matching.

I Clearly, for propensity score maching to isolate the treatment effect, three identifying assumptions are needed:

1. theunconfoundednessassumption: (Y(0);Y(1))qT|X;

2. thecommon supportassumption: 0<P(X)<1;

3. thebalancingassumption: T qX|P(X).

(32)

3.2. Choosing the propensity score function

I The propensity score matching approach builds on the unconfoundedness assumption, which requires that the outcome variable must be independent of treatment assignment conditional on observables.

I Hence, implementing this approach requires choosing a set of variables Xas predictors of the probability of being treated that credibly satisfy this condition.

I Put differently, all the variables that influence both treatment assignment and the outcome variable should be included.

(33)

I However, these variables should be those for which there are no feedback effects.

I They should be those that are unaffected by the treatment (or by the anticipation of the treatment).

I Therefore, they are:

I either fixed over time;

I or pre-treatment (i.e., measured before treatment

(34)

I Clearly, economic theory, a sound knowledge of previous research and also information about the institutional settings of the policy whose effect is estimated should guide the researcher in building up the model.

I It is important to mention that the final specification that is chosen by the researcher should be the one that

I satisfies the conditions above-mentioned;

I ensures that thecommon supportand thebalancing assumptions are satisfied.

I Remark: a discrete choice model (logitor probit) must be used to estimate the propensity score function since the dependent variable (being treated or not) is binary.

(35)

3.3. Computing the ATT

I Propensity score matching consists in computing the average difference between:

I the mean outcome of treated individuals characterized by a specific propensity score

I and the mean outcome of untreated individuals that are characterized by a similar propensity score.

I Hence, ATT, not ATE is estimated (since individuals are

(36)

4.1. Step 1: Questioning the plausibility of theunconfoundednessassumption

I What are the variables that influence both treatment assignment and potential outcomes?

I Is your dataset rich enough to control for all of them?

I If yes, proceed to Step 2.

(37)

4.2. Step 2: Estimating the propensity score function

I Estimate the probability of getting the treatment as a function of variables (either fixed over time or truly pre-treatment) that influence bothtreatment assignment and potential outcomes.

I Rely on alogit (orprobit) model.

(38)

4.3. Step 3: Generating propensity scores

I Do so for all treated and non treated observations.

I After your logit estimation, type the following command:

predict pscore, pr

(39)

4.4. Step 4: Testing thecommon supportassumption

I First, discard:

I observations (if any) in the control group whose propensity scores is less than the minimum propensity scores in the treatment group or whose propensity scores is higher than the maximum propensity scores in the treatment group;

I observations (if any) in the treatment group whose propensity scores is less than the minimum propensity scores in the control group or whose propensity scores is higher than the maximum propensity scores in the control group.

(40)

4.4. Step 4: Testing thecommon supportassumption

I Second, rely on thehistogram command to graphically test, for each strata (i.e. intervals) of your propensity score, whether there are observations both in the treated and in the non treated group.

I Third, rely on thetabulatecommand to numerically test for thecommon support assumption.

I If thecommon support assumption is violated, go back to step 2.

(41)

4.5. Step 5: Testing thebalancingassumption

I For each strata of your propensity score, and for each observable characteristic that allows you to estimate the propensity score function, run a difference of means analysis across treated and non treated observations.

I Rely on the ttest command.

I If thebalancing assumption is violated, go back to step 2.

(42)

4.6. Step 6: Matching treated with non treated individuals based on propensity scores

I To do so, rely on the psmatch2command.

I psmatch2is being continuously improved and developed.

I Make sure to keep your version up-to-date by typing the following command:

ssc install psmatch2, replace

(43)

I The psmatch2command allows to perform many matching methods (type help psmatch2for a full description).

I The most straightforward matching estimator is nearest neighbour (“NN” hereafter) matching.

I The individual from the comparison group is chosen as a matching partner for a treated individual because it is the closest in terms of the propensity score.

(44)

4.6. Step 6: Matching treated with non treated individuals based on propensity scores I Without replacement, an untreated individual can be used

only once as a match.

I In this case, estimate depends on the order in which observations get matched when there are more than one observation with the same propensity score.

I Hence, if you want to be able to replicate your results, it is critical to ensure that observations in the dataset are randomly ordered.

I To do so, type the following command:

generate random=runiform() sort random

(45)

I Withreplacement, an untreated individual can be used more than once as a match.

I Matching with replacement involves a trade-off between bias and variance:

I the average quality of matching increases and the bias decreases;

I the number of distinct non treated individuals used to construct the counterfactual outcome decreases and therefore

(46)

I Conducting NN matching with replacement is of particular interest with data where the propensity score distribution is very different in the treatment and the control group.

I For example, if we have a lot of treated individuals with high propensity scores but only few comparison individuals with high propensity scores, we get bad matches as some of the high-score treated individuals will get matched to low-score non treated individuals.

I In this case NN matching with replacement may be a good solution.

(47)

4.6. Step 6: Matching treated with non treated individuals based on propensity scores I Note that one can use more than one NN, what is called

oversampling(in this case, the matching is performed with replacement).

I Finally, it is worth emphasizing that NN matching faces the risk of bad matches if the closest neighbour is far away.

I This can be avoided by imposing a tolerance level on the maximum propensity score distance (what is called caliper):

I bad matches are avoided and the matching quality rises (bias decreases);

(48)

4.7. Step 7: Ensuring that the balancing assumption is satisfied after matching

I Check whether the observable characteristics of treated and non treated individuals that were matched during the matching procedure are indeed similar.

I To do so, one needs to rely on the command pstest(after psmatch2).

I The difference of means analysis is provided aftermaching.

I For good balancing, it should be non significant.

(49)

4.7. Step 7: Ensuring that the balancing assumption is satisfied after matching

I At the end of the output of pstest, the mean of the absolute value of the “standardized percentage bias” after matching is provided.

I This bias should be less than 5%.

I If it is greater than 5%, go back to step 6 (find another way of matching treated and non treated individuals), or to step 2 if necessary.

(50)

I You can think of relying on propensity score matching if a rich and large dataset is available.

I However, this should be your least preferred option.

I The unconfoundednessoption is indeed very difficult to buy...

no empirical research based on PSM is published in the highest ranking journals anymore.

I At any rate, if you rely on PSM, you must do so in a very rigorous way (i.e. implement each of the steps above mentioned very carefully).

(51)

I Dehejia, Rajeev H. and Sadek Wahba. 1999. Causal effects in nonexperimental studies: reevaluating the evaluation of training programs. Journal of the American Statistical Association94(448): 1053-1062.

I Lalonde R. 1986. Evaluating the econometric evaluations of training programs. American Economic Review 76(4):

604-620.

I Rosenbaum, P. and Rubin, D. 1983. The central role of the propensity score in observational studies for causal effects.

(52)

I The RDD is considered by scholars as an evaluation strategy that provides results which are as compelling as the estimates derived from randomized experiments, knowing that

randomized experiments are widely seen as the gold standard of impact evaluation.

I Therefore, once one is aware of the specific features of the RDD, it is critical to know when and how this valuable evaluation strategy can be used.

(53)

1. What are the specific features of the RDD?

2. When can one implement the RDD?

3. How must one implement the RDD?

(54)

I Like randomized experiments, the RDD consists in comparing 2 outcome variables: the outcome variable related to

individuals who were treated (i.e.: who received a treatment) and the outcome variable related to individuals who were not treated (i.e.: who did not receive the treatment).

I Yet, the RDD has 2 specific features.

I First, the treatment of the population depends on whether an observed variable exceeds a critical value denotedc, knowing that this variable is not orthogonal to the observed and unobserved characteristics of individuals.

(55)

I This variable is called the “assignment” variable or the

“forcing” variable.

I Denote X this assignment (or forcing) variable.

I The first specific feature of the RDD is therefore given by:

D =0, if X <c D =1, if X ≥c,

whereD =0 means that the population is not treated and

(56)

I For instance, the RDD was first used by Donald L.

Thistlethwaite and Donald T. Campbell (Journal of

Educational Psychology, 1960) in order to analyze the impact of merit awards on future academic outcomes (career

aspirations, enrollment in postgraduate programs, etc.).

I The RDD was particularly suitable to this research objective since the allocation of the merit awards depended on individuals’ observed test scores:

I Students with test scoresX greater than or equal to a cutoff valuec received the award;

I Students with test scoresX below the cutoffc were denied the award.

(57)

I Clearly, the assignment (or forcing) variable that consists in individuals’ test scores is not orthogonal:

I either to individuals’ observed characteristics (for instance, the socio-economic status of their parents);

I or to individuals’ unobserved characteristics (for instance, their Intelligence Quotient, their “taste for working”...etc).

(58)

I Second, the RDD estimates the causal impact of the treatment (e.g.: receiving the merit award) on the outcome (e.g.: academic achievements) by computing the difference between the outcome of treated individuals characterized by an X located just abovec and the outcome of non treated individuals characterized by anX located just below c.

(59)

I Assume that the relationship between y (academic achievements) and X (test scores) is as follows:

(60)

I To measure the causal impact of the merit award, the RDD consists in computing, for the same individual, the y he would get with and without the treatment.

I To do so, the RDD focusses on individuals who scorec and reasons as follows:

I B⁰ (that is related to a score c⁰ located just abovec) would be a reasonable guess for the value ofy of the individual scoringc in case he receives the treatment;

I A⁰⁰ (that is related to a scorec⁰⁰ located just belowc) would be a reasonable guess for the value ofy of the individual scoringc in the counterfactual case where he doesn’t receive the treatment.

I As a consequence, the RDD considers B⁰−A⁰⁰= τas the causal impact of merit awards on academic achievements.

(61)

I Two conditions must be satisfied.

I Firstand obviously, the treatment of the population must depend on whether an observed variable exceeds a critical value denoted c.

I Second, for τto be considered as capturing the impact of the merit award on academic achievements, one must make sure that individuals do not have a precise control on the

assignment (or forcing) variable.

(62)

I If individuals have a precise control on the assignment (or forcing) variable, this means that individuals of different types (characterized by different sets of observed and

unobserved characteristics) will reach distinct outcomes.

I More precisely, individuals on one side of the cutoffc (i.e: at X =c⁰⁰ =c−ewhen e→0) will be systematically different from those on the other side (i.e: at X =c⁰ =c+ewhen e→0), both with respect to observed and unobserved characteristics.

I Let’s call individuals who, given their characteristics, reach X =c⁰⁰ for sure “type Aindividuals” and individuals who, given their characteristics, reach X =c⁰ “type B individuals”.

(63)

I Put differently, when individuals have a precise control over the assignment (or forcing) variable, it is not possible to attribute the jump in y (the fact that y is a discontinuous function of the test score) to the impact of the merit award only.

I Indeed, the jump iny also reflects the jump in individuals’

observed and unobserved characteristics in that case.

(64)

I On the contrary, if individuals have no precise control on the assignment (or forcing) variable, then τ can be considered as the impact of the merit award on academic achievements.

I The expression “no precise control” means that individuals have an imprecise control on the assignment (or forcing) variable.

(65)

I In other words, among those scoring near the threshold, it is a matter of “luck” as to which side of the threshold they land.

I Put differently, type Aindividuals have the same probability as type B individuals to be just above rather than just below the threshold.

I This allows to say that those who marginally fail (those characterized by a grade just below the cutoff) and those who marginally pass (those characterized by a grade just above the cutoff) are identical .

I This is the reason why, if the “no precise control”

assumption is satisfied, the RDD is considered as a local

(66)

Asadullah (2005), “The effect of class size on student achievement: evidence from Bangladesh”, Applied Economics Letter

I In Bangladesh, a Ministry of Education (MoE) circular maintains that registered secondary schools can recruit a new teacher if class enrolment exceeds 60

I Such a teacher allocation rule results in an abrupt drop in class size whenever observed grade enrolment exceeds 60 or an integer multiple of 60 → discontinuity

I The true causal effect is recoverable if one uses the class size predicted by the rule as an instrument for observed class size in the achievement function

I Effect on aggregate pass rate

(67)

Asadullah (2005), “The effect of class size on student achievement: evidence from Bangladesh”, Applied Economics Letter

P_j =α+θComp_j +δE_j10+β_IV +CS^ˆ _j₁₀+

∑

ij

φ_iSchType_ij +ej

whereP_j aggregate pass rate in SSC examination in the jth school (fraction of grade 10 students passing the examination by securing more than 60% marks);

Comp_j competition index;E_j10 Total enrolment in grade 10;CSˆ _j10 instrumented class size for grade 10 in jth school;SchTypeij School type (Public, private aided, girls, boys, co-education and double shift) of jth school.

I Instrumentfor class sizeP−Csize_j is a prediction of class size as a

(68)

(69)

Edmonds (2004), “Does Illiquidity alter Child labor and schooling decisions? Evidence from Households responses to anticipated cash transfers in South Africa”, NBER WP 10265

I The response of child labour supply and schooling attendance to anticipated social pensions income in South Africa

I Pension benefits are largely determined by age for black population (extension of the Old Age Pension (OAP) program after the end of apartheid)

I The paper uses the age discontinuity in the pension benefit formula for identification

I More precisely, the paper examines the response of child labour to the timing of income by comparing child labour supply and

schooling in households that are eligible for the OAP to households

(70)

(71)

(72)

(73)

I The RDD provides a highly credible and transparent way of estimating treatment effects.

I You can rely on it as a substitute to randomized experiments as soon as there exists an assignment (or forcing) variable on which individuals have no precise control.

I Think about using it as an impact evaluation strategy for your Master 2 dissertation!