Research in Applied Econometrics Chapter 1. R

(1)

Research in Applied Econometrics Chapter 1. R

Pr. Philippe Polomé, Université Lumière Lyon 2

M1 APE Analyse des Politiques Économiques M1 RISE Gouvernance des Risques Environnementaux

2020 – 2021

(2)

Research in Applied Econometrics Chapter 1. R R basic operations

Outline

R basic operations

R graphics

Linear Regressions

Discussing Regressors and Model Building

(3)

Are you operational ?

I

Create a project “RAE” in the folder for your courses

I Download theRAE2020.R from my courses’ site

I Into the same directory as your project I Open it from R-studio Editor

I

R-Studio recalls the projects

I You can go from one to another

I All the files written on disk remain available I

swirl should be installed

I Load it now e.g. library(swirl) I Typeswirl( )in concole

I do course 1 : R programming, Lesson 1 I I would like to see how you are doing

I You should have done it and 13 lessons by yourself, from home

(4)

Use the code file

I

Execute some commands on the code file to see the output

I

The packages AER & DCchoice must be installed for this

course

I

Usual math functions : log, exp, sign, sqrt, abs, min, max

I log(exp(sin(pi/4)^2)*exp(cos(pi/4)^2))Type in Console←- I

Special vectors

I ones <- rep(1, 10)

I even <- seq(from = 2, to = 20, by =2) I trend <- 1981 :2005

I

diag(4) Identity matrix of size 4

(5)

Matrix Operations

I

A<-matrix(1 :6, nrow = 2)

I Alook what it looks like & how R gives the position of the elements

I Look at your environment window : A is now there I It remains in you project until erased (the brush)

I

t(A) = transpose of A (

not A’

)

I

dim(A) = dimensions of A (Row then Column)

I

nrow(A) ; ncol(A) nbr R ; C

I

A[i,j] extract element (i,j)

I Does not remove it from the matrix I

A[,j] extract C j (all the R) into one vector

I A[i,]same for R i

I

A1<-A[1 :2, c(1, 3)] A1 has 2 R containing the elts in R 1 to 2 and C 1 & 3 from A

I For this particular matrix, same result withA[,-2]

(6)

Matrix Operations

I

det(A1) determinant

I

solve(A1) inverse

I

A %*% B matrix product

I A*Aelement-by-element product

I

crossprod(A, B) efficient calculation of A’B

I

diag(A1) extract diag

I

cbind(1, A1) “combine” one C of ones and A1

. . .

. → . .

I

rbind(A1, diag(4, 2)) “stacks” A1 & a diag matrix of size 2 with 4 on the diag

. . . .

↑ . . . .

(7)

Dataframe

I

“Frame” = “context”

I In R, a “Dataframe” is a data matrix I a collection of vectors of same length I Stacked together horizontaly

I

Each vector = 1 C = “variable”

I Possibly of different natures

I quantitative, numeric but qualitative, characters, dates...

I it may contain meta-data

I e.g. variable type or categories name

I

Each R = 1 obs in the sample

(8)

Dataframe Creation

I

Several ways

I keyboard (cf. Swirl R-programming lesson 7) I read R file

I import I

keyboard example

I alternative 1

I mydata <- data.frame(one = 1 :10, two = 11 :20, three = 21 :30)

I alternative 2

I mydata <- as.data.frame(matrix(1 :30, ncol=3))and names(mydata) <- c(“one”, “two”, “three”)

I

R is not very good for encoding data manually

I But we use this example to explain attachment (below)

(9)

attach

I

A dataframe is “attached”

I with commandattach

I then variables’ names in the dataframe maybe used directly in commands

I

For example

I mean(two)produce an error message

I attach(mydata) and thenmean(two)produces the average of variable “two”

I

detach(mydata) is self-explanatory

I Why detach ? e.g. to avoid confusions

(10)

Subset Selection

I

As seen in

swirl

a subset of a Dataframe can be accessed by [ or $

I $ extract a single variable

I

The command

subset

sometimes work better (e.g.

conditional selection)

I e.g.mydata.sub<-subset(mydata, two<=16, select = -two) I selects all the obs. of variables one & three

I fow which the obs of variable 2 are≤16

(11)

Export ( write ) a dataframe

I

write.table(mydata, file=“mydata.txt”, col.names=TRUE)

I create a .txt file mydata.txt in the working directory

I normally where your project is I Meta-data are not passed

I The text file format is

“one” “two” “three”

“1” 1 11 21

“2” 2 12 22

...

I So that it looks like the column headers are shifted left I Take that into account accordingly with the software you use

to open it

(12)

Import ( read ) a dataframe

I

The Environment window has a button that makes it easier

I a preview is generated

I Use the "import dataset" button in the Environment window to read "mydata.txt" back into R

I

Import from another software : excel, stata, sas...

I Easiest : if you have access to the software, export the data file in txt or csv

I loss of meta-data

I R-Studio proposes several formats

I It does not work often as these softwares change their formats often

I Use Google

I e.g. “R import Stata 17 data”

I Alsowww.statmethods.net/input/importingdata.html I for a few formats

(13)

Outline

R basic operations

R graphics

Linear Regressions

Discussing Regressors and Model Building

(14)

Research in Applied Econometrics Chapter 1. R R graphics

Plot

I

You have seen some plots on the course R-programming, lesson 15 Base graphics

I

A few additional graphic elements using package

plot I Packageslattice ggplot2are better

I varianceexplained.org/RData/code/code_lesson2/

I R has many publication-quality graphics I But they are not very intuitive

I

plot( ) is the default graphic command for many objects :

I dataframes, time series, fitted linear models

I it is also an old, crude, command

I but many R programmers have connected their packages with it

(15)

Examples with data("CPS1988")

I

Data file is

cps1988

preloaded in the AER package

I Pop. survey March 1988, US Census Bureau I 28 155 obs., cross-section

I Men, 18-70 year-old I Income > US$ 50 in 1988

I Not self-employed, not working without salary I

summary(CPS1988)

I

Quantitative data

I wage$/week

I education& (potential work)experienceyears

(16)

“Scatterplots” – dispersion – XY

I

Probably the + common in stat, with histograms

I We use CPS1988 : a census data file on wage and its

determinants

I From the AER package I

attach(CPS1988)

I plot(education, log(wage))

I First is on arg in x-axis, 2nd in y-axis

I To export a plot : “Export” button in Plots window I There are several formats

I png is easiest to use in word processing

I

detach(CPS1988)

I

plot(log(wage)~education, data=CPS1988)

I alternative to avoid attaching the dataframe

(17)

R Graphic Parameters

I

A plot results may be modified in many ways

I E.g. argumenttypecontrols if the plot is made points (type = p), lines (type = l), both (type = b), steps (type = s) or others

I

Several dozens parameters may be modified

I See ?par

I They may be modifiedafterthe plot with commandpar( ) I Or they can be supplied in theplot( ) command e.g.

plot(log(wage)~education, data=CPS1988, pch=20, col="blue", ylim=c(4,10), xlim=c(0,20), main="Wage by education years")

(18)

R Additional Graphics

I

Add layer(s) to a plot : lines( ), points( ), text( ), legend( )

I Add a straight lineabline(a, b)

I a intercept, b slope

I

Barplots, pie charts, boxplots, QQ plots & histograms

I barplot( ), pie( ), boxplot( ), qqplot( ), hist( ) I We’ll see below

(19)

Histograms & boxplots

I

Continue with CPS1988 data base on wage & its determinants

I summary(CPS1988)reveals that some variables are categorical I Categorical : calledfactorsin R

I Factors

are vectors of categories

I sometimes withmetadata

I e.g. categories names I g <- rep(0 :1, c(2,4))

I g <- factor(g, levels=0 :1, labels=c("male", "female")) I Name categories (0,1) of g into “Male”(=0) & “Female”

I so g is [1] male male female female female female

(20)

Factors in CPS1988

I

In CPS1988, the factors are

I ethnicityis caucasian “cauc” & african-american “afam”

I smsaresidence in urban area I region

I parttimeworks part-time I

Plots according to data type

I Numerical/Quantitative or categorical I Single variable or 2 in relation

(21)

One numerical variable : histogram & density

I

hist(wage, freq=FALSE)

I optionfreq=FALSE

I relative frequencies, else absolute (counting) I optionbinwidth=zzz

I “bin” = container : chose the length of the base of the rectangles

I

hist(log(wage), freq=FALSE)

I

lines(density(log(wage)), col=4)

I Commanddensity is actually a non-parametric estimate of the density function

I

Remarks

I log distribution is less asymetrical than the raw data I data in log are often closer to a normal

I That is often the case with econ. data & a rationale for the normal hypothesis

(22)

One categorical

I

With categorical data

I Mean & variance have no meaning I But frequencies do

I

summary(region) : absolute frequencies (counts)

I

tab <- table(region) : stores these freq. in a table called tab

I

prop.table(tab) computes the proportions (relative freq.)

I

Barplots & pie visualise often quite well cat. data

I barplot(tab) I pie(tab)

I These plots can be modified using parameters

(23)

2 categorical

I

Usually presented in a Contingency Table

I xtabs( )with a formulainterface :

I e.g.xtabs(~ ethnicity + region, data = CPS1988) I data is optional since it is stillattached

I table(ethnicity, region)mêmes résultats I

A plot of that is a “spine plot”

I plot(ethnicity ~ region)Formula

I plot(ethnicity, region)What differences ?

(24)

2 numerical

I

The Correlation Coefficient

r

is often used

I For positive & asymetrical variables : Spearman’sρ

I rankscorrelation, instead of values, is often prefered becauser is not robust to asymetry

I

cor(log(wage), education)

I

cor(log(wage), education, method="spearman")

I Results differ a bit

I

plot(log(wage)~education)

I scatterplot shows little correlation

I but log makes it difficult to see graphically

(25)

1 numerical & 1 categorical

I

Often, conditionnal moments are calculated

I e.g. average wage by ethnicity

I tapply(log(wage), ethnicity, mean)

I “Applies” the command “mean” on the 2 variables ethnicity &

log(wage)

I Mean may be replaced by any valid command, e.g quantile

I

The Box plots & QQ (quantile-quantile) plots are often used

(26)

1 numerical & 1 categorical : Box plot

I A box plot is a crude representation of an empirical distribution

I The box is limited by “hinges” (1º & 3º quartiles) and show the median

I Outside of the box, 2 lines indicate the smallest & largest obs.

I within 1.5×size of the box from the closest hinge I Any obs. outside is represented by separate points I

boxplot(log(wage)~ethnicity)

(27)

1 numerical & 1 categorical : QQ plot

I A QQ plot matchesthe quantiles of 2 (empirical) distributions

I Recall that quantiles are quantities

I e.g. the 1º quartile of afam wage is the wage s.t. 25% of afam make less & 75% +

I If the 2 distributions are identical : QQ plot = diagonal I Otherwise, if e.g. cauc make more than afam, then

I with cauc on the x-axis, the QQ plot will be below the diag.

I A bit like the plot of income inequality, but with 2 variables I awage <- subset(CPS1988, ethnicity == "afam")$wage I cwage <- subset(CPS1988, ethnicity == "cauc")$wage I qqplot(awage, cwage)

I abline(0,1)overlay the diag (intercept 0, slope 1) I

detach(CPS1988) to close CPS1988

(28)

Research in Applied Econometrics Chapter 1. R Linear Regressions

Outline

R basic operations

R graphics

Linear Regressions

Discussing Regressors and Model Building

(29)

Basic Regression Commands in R

I

Linear Regression Model LRM

y_i

=

x_i⁰β

+

_i

with

i

= 1...n

I In matrix formy =Xβ+ I

Typical Hyp. in cross-sections

I E(|X) = 0 (exogeneity)

I Var(|X) =σ²I (“sphericity” : homoscedasticity & no autoc.) I

In R, models are usually fitted by calling a command

I For the LRM in cross-section :fm <- lm(formula, data,...) I Argument ... replace a series of arguments

I describing the model

I or choosing the computation mode (algorithm) I or options

(30)

Basic Regression Commands in R

I

The lm command returns an

object

I Here : the fitted model under the namefm I Maybe visualised in many ways or summarized

I

The lm object can be used to compute/extract :

I Predictions & fitted values, residuals, ... by means of fm$...

I Tests & several postestimations diagnostics I

Most estimation commands work the same way

I

Ideally do SWIRL, course : Regression, lessons 1-7

I My course will then be easier

(31)

Multivariate Linear Regression with Factors

I

The purpose of this example is to demonstrate various R tools that are used to transform & combine regressors

I

Dataframe :

cps1988

as before

(32)

Wage Equation

I

Wage Equation

log (wage) =

β₁

+β

₂exp+β₃exp²

+β

₄education+β₅ethnicity

+

cps_lm<-lm(log(wage)~experience+I(experience^2) +education+ethnicity, data=CPS1988)

I

“Insulation function” I( )

I indicates to R that ^2 be understood as the square of exp I otherwise, R is unsure of the meaning and withdraws

experience^2

I That is because the model written in R issymbolicnot mathematic

I This might be clearer with a formula y ~ a + b+c I In a regression context, this meansy against 3 regressors I Instead y ~ a + I(b+c) meansy against 2 regressors

(33)

Results & Testing

I

summary(cps_lm)

I The return of education (to the wage) is 8.57%/year I % interpretation because wage is in log model I Categorical variables are managed by R

I that selects the reference cat.

I

Compare Nested Models : Anova (Analysis of Variance) Table

I More constraint model :

cps_noeth<-lm(log(wage)~experience+ I(experience^2)+education, data=CPS1988)

I Usually, the test is on more than one variable I anova(cps_noeth,cps_lm)

I Compare non-nested model ? next year

(34)

Interactions : effects of combined regressors

I

e.g. in labor econ : the combined effect of education &

ethnicity

I Does one year of Education have the same return for different ethnicities ?

I

This is modeled with

multiplicative

terms

I Consider

log (wage) =β1+β2ethnicity+β3ethnicity×education+β4education+

I Then∂log (wage)/∂education=β₃ethinicity+β₄

I Ifethinicity = 0, then the effect of 1 year of education isβ4

I Ifethinicity = 1, then the effect of 1 year of education is β3+β4

I

More examples

I Let a, b, c three factors so that each has several discrete levels I and x, y two continuous variables (quantitative)

(35)

Several Models/Formulas with Interactions

I

y~a+x : no interaction

I A single slope (of x) but one intercept for each level of factor a I

y~a*x : same as previous model +

I one interaction term for each level of a with x (different slopes) I In a more formal notation, letd_ai =I(a=i) :

[y∼a∗x]≡

"

y =βai

X

i

dai+γaixX

i

dai

#

I Note : y~a*x is the same as y~a+x+a*x

(36)

Formulas with Interactions

I

y~(a+b+c)^2

I models all the interactions at 2 variables I but not at 3

I So this is like as many dichotomous variables as the number of levelsd_ai−bj=I(a=i∧b=j) for a & b

I and similarly for a & c and for c & b

I

SWIRL course Regression Models

I Lesson 8 : MultiVar Examples3

(37)

Interactions Wage eq. : ethnicity & education

I cps_int<-lm(log(wage)~experience+I(experience^2) +education*ethnicity, data=CPS1988)

I Only one of the “+” fromcps_lm has been replaced by* I

coeftest(cps_int)

I A more compact version of summary( )

I That can also be used on some other regression commands I

The regression outputs the effects of education & ethnicity

I called “main effects”

I and the product of education & an indicator for the level

“afam” ofethnicity

I Why afam ? Probably because it is less numerous than cauc

(38)

Interactions Wage eq. : ethnicity & education

I

afam has a neg. effect on the intercept

I lower average wage for african-american I AND on the slope ofeducation

I lower return ofeducationfor african-american

I

The effect is not much significant though

I since a 5% significance with a sample of nearly 30 000 individuals is not much convincing

I Next year, we’ll see specifications in which this effect disappears

(39)

Predictions

I

First define the values for which you want to predict

I We simplify the model to exp. & educ. for ease of presentation I Let’s say we want to show the effect of exp. at an average level

of educ.

I

Create a new data frame with a column of average Educ & a column of all the possible values of Exp

I Note that in the Census, some people have negative experience !

I This is because Exp. is computed as age-education-6 so that people who complete their studies early may have “negative”

Exp

I

Use a predict( ) command on

I the lm object of interest : cps_lm here

I the new data set for which we want prediction : cps2 here I predict( ) can not only gives a prediction but also bounds I Plot that on the data

I

detach(CPS1988) when you are done to avoid confusion

(40)

Research in Applied Econometrics Chapter 1. R Discussing Regressors and Model Building

Outline

R basic operations

R graphics

Linear Regressions

Discussing Regressors and Model Building

(41)

When building a model, there are 2 contradictory forces

I

If we omit a regressor, and it is in fact relevant

I unobserved heterogeneity & inconsistency of LS estimators I if it is correlated to included regressors

I we sometimes can deal with that using instruments or panel I

If we include irrelevant regressor that are correlated with

relevant ones

I we create multicollinearity with the consequence that both relevant & irrelevant regressors may appear non-signif.

I That may even occur with 2 relevant regressors, e.g. in a Quantity-Price relation, the price of the subtitutes goods are relevant, but may be correlated with own price

(42)

Collinearity – Endogeneity Trade-off

I

From a statistical point of view, 2 collinear variables carry the same information

I Their separate influence on the dependant variable cannot be assessed in the present sample

I Be pragmatic : reject one of the 2 or merge them in some way that makes sense in context

I

It is not really possible to escape such a trade-off

I Especially since in a particular sample, a relevant regressor may coincidentally appear non significant (if the sample is not large) I

Theory does not help by nature

I since an empirical model is a trial of a model I theory helps interpreting results, not guide them

(43)

Progressive Inclusion

I

is an old way of looking at model building

1.

Among potential regressors

x, take the one with highest

correlation with

y

2.

Regress

y

on that single regressor

I Is it significant ?

I No : you don’t have a model

I Yes : estimate the one-regressor model & compute its residuals

3.

Among the remaining regressors, take the one with highest correlation with the residuals

4.

Repeat previous steps with progressively more regressors

I Until one that is non-significant

(44)

Progressive Inclusion

I

The issue with this approach is that if there are several relevant regressors

I then at least the first step might be inconsistent I because at least one relevant regressor is missing

I

This is a very serious issue that leads to non-sensical results

(45)

Progressive Elimination

I

Instead, consider the “largest reasonnable set of regressors”

I can be linked to the theory you want to test or to previous experience

I

It is risky to just run this “encompassing” regression and report the results

I because of multicollinearity

(46)

Progressive Elimination

I

Gradually remove regressors one by one

I Examine how the estimates of the remaining regressors evolve I Use ANOVA to test over several regressors

I If there is a noticeable increase in significance I but not so much change in estimates I collinearity was an issue

I If estimated coefficients change wildly I omitted regressor endogeneity

I

However

I dropping collinear regressor could lead to jumps in coef estimates

I after all, collinearity affects their variance

I dropping a relevant regressor does not necessarily lead to major changes in the other coef

I when that regressor is not much correlated to the others

(47)

Summing up

I

Model

y_i

=

β₀

+

β₁x_1i

+

β₂x_2i

+

_i

(no missing relevant regressor)

I estimation by MCO whenx2 andx1 are correlated I if they are not, there is NO serious consequences for ˆβ1

I “not relevant but correlated to a relevant regressor” might not be empirically common

x2

Consequences on ˆ

β1

on ˆ

β2

relevant included Best case, but still may appear insignificant

not incl. Inconsistent –

not relev. included May appear insignificant should

→

0 not incl. No problem presumably –

(48)

Essay #1

In R, type data(package="AER")

From the list of data sets, select the one closest to your family name. Use help(dataset name) for a description of the data set.

Write an R code that uses this dataset to :

1.

Plot a scatterplot, an histogram, a spine plot (between categorical variables) and a boxplot.

2.

Select a dependant variable and a couple of explanatory variables. Regress the former against the latter, with and without interactions (so : 2 regressions). This need not make much economic sense, as the focus is on the R code, but it is better when it is a sensible economic regression.

3.