Research in Applied Econometrics Chapter 1. R
Pr. Philippe Polomé, Université Lumière Lyon 2
M1 APE Analyse des Politiques Économiques M1 RISE Gouvernance des Risques Environnementaux
2020 – 2021
Research in Applied Econometrics Chapter 1. R R basic operations
Outline
R basic operations
R graphics
Linear Regressions
Discussing Regressors and Model Building
Are you operational ?
I
Create a project “RAE” in the folder for your courses
I Download theRAE2020.R from my courses’ siteI Into the same directory as your project I Open it from R-studio Editor
I
R-Studio recalls the projects
I You can go from one to anotherI All the files written on disk remain available I
swirl should be installed
I Load it now e.g. library(swirl) I Typeswirl( )in concole
I do course 1 : R programming, Lesson 1 I I would like to see how you are doing
I You should have done it and 13 lessons by yourself, from home
Research in Applied Econometrics Chapter 1. R R basic operations
Use the code file
I
Execute some commands on the code file to see the output
IThe packages AER & DCchoice must be installed for this
course
I
Usual math functions : log, exp, sign, sqrt, abs, min, max
I log(exp(sin(pi/4)^2)*exp(cos(pi/4)^2))Type in Console←- ISpecial vectors
I ones <- rep(1, 10)
I even <- seq(from = 2, to = 20, by =2) I trend <- 1981 :2005
I
diag(4) Identity matrix of size 4
Matrix Operations
I
A<-matrix(1 :6, nrow = 2)
I Alook what it looks like & how R gives the position of the elements
I Look at your environment window : A is now there I It remains in you project until erased (the brush)
I
t(A) = transpose of A (
not A’)
I
dim(A) = dimensions of A (Row then Column)
Inrow(A) ; ncol(A) nbr R ; C
I
A[i,j] extract element (i,j)
I Does not remove it from the matrix I
A[,j] extract C j (all the R) into one vector
I A[i,]same for R i
I
A1<-A[1 :2, c(1, 3)] A1 has 2 R containing the elts in R 1 to 2 and C 1 & 3 from A
I For this particular matrix, same result withA[,-2]
Research in Applied Econometrics Chapter 1. R R basic operations
Matrix Operations
I
det(A1) determinant
Isolve(A1) inverse
I
A %*% B matrix product
I A*Aelement-by-element product
I
crossprod(A, B) efficient calculation of A’B
Idiag(A1) extract diag
I
cbind(1, A1) “combine” one C of ones and A1
. . .
. → . .
I
rbind(A1, diag(4, 2)) “stacks” A1 & a diag matrix of size 2 with 4 on the diag
. . . .
↑ . . . .
Dataframe
I
“Frame” = “context”
I In R, a “Dataframe” is a data matrix I a collection of vectors of same length I Stacked together horizontaly
I
Each vector = 1 C = “variable”
I Possibly of different natures
I quantitative, numeric but qualitative, characters, dates...
I it may contain meta-data
I e.g. variable type or categories name
I
Each R = 1 obs in the sample
Research in Applied Econometrics Chapter 1. R R basic operations
Dataframe Creation
I
Several ways
I keyboard (cf. Swirl R-programming lesson 7) I read R file
I import I
keyboard example
I alternative 1
I mydata <- data.frame(one = 1 :10, two = 11 :20, three = 21 :30)
I alternative 2
I mydata <- as.data.frame(matrix(1 :30, ncol=3))and names(mydata) <- c(“one”, “two”, “three”)
I
R is not very good for encoding data manually
I But we use this example to explain attachment (below)
attach
I
A dataframe is “attached”
I with commandattach
I then variables’ names in the dataframe maybe used directly in commands
I
For example
I mean(two)produce an error message
I attach(mydata) and thenmean(two)produces the average of variable “two”
I
detach(mydata) is self-explanatory
I Why detach ? e.g. to avoid confusionsResearch in Applied Econometrics Chapter 1. R R basic operations
Subset Selection
I
As seen in
swirla subset of a Dataframe can be accessed by [ or $
I $ extract a single variable
I
The command
subsetsometimes work better (e.g.
conditional selection)
I e.g.mydata.sub<-subset(mydata, two<=16, select = -two) I selects all the obs. of variables one & three
I fow which the obs of variable 2 are≤16
Export ( write ) a dataframe
I
write.table(mydata, file=“mydata.txt”, col.names=TRUE)
I create a .txt file mydata.txt in the working directoryI normally where your project is I Meta-data are not passed
I The text file format is
“one” “two” “three”
“1” 1 11 21
“2” 2 12 22
...
I So that it looks like the column headers are shifted left I Take that into account accordingly with the software you use
to open it
Research in Applied Econometrics Chapter 1. R R basic operations
Import ( read ) a dataframe
I
The Environment window has a button that makes it easier
I a preview is generatedI Use the "import dataset" button in the Environment window to read "mydata.txt" back into R
I
Import from another software : excel, stata, sas...
I Easiest : if you have access to the software, export the data file in txt or csv
I loss of meta-data
I R-Studio proposes several formats
I It does not work often as these softwares change their formats often
I Use Google
I e.g. “R import Stata 17 data”
I Alsowww.statmethods.net/input/importingdata.html I for a few formats
Outline
R basic operations
R graphics
Linear Regressions
Discussing Regressors and Model Building
Research in Applied Econometrics Chapter 1. R R graphics
Plot
I
You have seen some plots on the course R-programming, lesson 15 Base graphics
I
A few additional graphic elements using package
plot I Packageslattice ggplot2are betterI varianceexplained.org/RData/code/code_lesson2/
I R has many publication-quality graphics I But they are not very intuitive
I
plot( ) is the default graphic command for many objects :
I dataframes, time series, fitted linear modelsI it is also an old, crude, command
I but many R programmers have connected their packages with it
Examples with data("CPS1988")
I
Data file is
cps1988preloaded in the AER package
I Pop. survey March 1988, US Census Bureau I 28 155 obs., cross-sectionI Men, 18-70 year-old I Income > US$ 50 in 1988
I Not self-employed, not working without salary I
summary(CPS1988)
I
Quantitative data
I wage$/weekI education& (potential work)experienceyears
Research in Applied Econometrics Chapter 1. R R graphics
“Scatterplots” – dispersion – XY
I
Probably the + common in stat, with histograms
I We use CPS1988 : a census data file on wage and itsdeterminants
I From the AER package I
attach(CPS1988)
I plot(education, log(wage))
I First is on arg in x-axis, 2nd in y-axis
I To export a plot : “Export” button in Plots window I There are several formats
I png is easiest to use in word processing
I
detach(CPS1988)
I
plot(log(wage)~education, data=CPS1988)
I alternative to avoid attaching the dataframeR Graphic Parameters
I
A plot results may be modified in many ways
I E.g. argumenttypecontrols if the plot is made points (type = p), lines (type = l), both (type = b), steps (type = s) or others
I
Several dozens parameters may be modified
I See ?parI They may be modifiedafterthe plot with commandpar( ) I Or they can be supplied in theplot( ) command e.g.
plot(log(wage)~education, data=CPS1988, pch=20, col="blue", ylim=c(4,10), xlim=c(0,20), main="Wage by education years")
Research in Applied Econometrics Chapter 1. R R graphics
R Additional Graphics
I
Add layer(s) to a plot : lines( ), points( ), text( ), legend( )
I Add a straight lineabline(a, b)I a intercept, b slope
I
Barplots, pie charts, boxplots, QQ plots & histograms
I barplot( ), pie( ), boxplot( ), qqplot( ), hist( ) I We’ll see belowHistograms & boxplots
I
Continue with CPS1988 data base on wage & its determinants
I summary(CPS1988)reveals that some variables are categorical I Categorical : calledfactorsin R
I Factors
are vectors of categories
I sometimes withmetadataI e.g. categories names I g <- rep(0 :1, c(2,4))
I g <- factor(g, levels=0 :1, labels=c("male", "female")) I Name categories (0,1) of g into “Male”(=0) & “Female”
I so g is [1] male male female female female female
Research in Applied Econometrics Chapter 1. R R graphics
Factors in CPS1988
I
In CPS1988, the factors are
I ethnicityis caucasian “cauc” & african-american “afam”
I smsaresidence in urban area I region
I parttimeworks part-time I
Plots according to data type
I Numerical/Quantitative or categorical I Single variable or 2 in relation
One numerical variable : histogram & density
I
hist(wage, freq=FALSE)
I optionfreq=FALSEI relative frequencies, else absolute (counting) I optionbinwidth=zzz
I “bin” = container : chose the length of the base of the rectangles
I
hist(log(wage), freq=FALSE)
Ilines(density(log(wage)), col=4)
I Commanddensity is actually a non-parametric estimate of the density function
I
Remarks
I log distribution is less asymetrical than the raw data I data in log are often closer to a normal
I That is often the case with econ. data & a rationale for the normal hypothesis
Research in Applied Econometrics Chapter 1. R R graphics
One categorical
I
With categorical data
I Mean & variance have no meaning I But frequencies do
I
summary(region) : absolute frequencies (counts)
I
tab <- table(region) : stores these freq. in a table called tab
Iprop.table(tab) computes the proportions (relative freq.)
IBarplots & pie visualise often quite well cat. data
I barplot(tab) I pie(tab)
I These plots can be modified using parameters
2 categorical
I
Usually presented in a Contingency Table
I xtabs( )with a formulainterface :I e.g.xtabs(~ ethnicity + region, data = CPS1988) I data is optional since it is stillattached
I table(ethnicity, region)mêmes résultats I
A plot of that is a “spine plot”
I plot(ethnicity ~ region)Formula
I plot(ethnicity, region)What differences ?
Research in Applied Econometrics Chapter 1. R R graphics
2 numerical
I
The Correlation Coefficient
ris often used
I For positive & asymetrical variables : Spearman’sρ
I rankscorrelation, instead of values, is often prefered becauser is not robust to asymetry
I
cor(log(wage), education)
I
cor(log(wage), education, method="spearman")
I Results differ a bitI
plot(log(wage)~education)
I scatterplot shows little correlation
I but log makes it difficult to see graphically
1 numerical & 1 categorical
I
Often, conditionnal moments are calculated
I e.g. average wage by ethnicityI tapply(log(wage), ethnicity, mean)
I “Applies” the command “mean” on the 2 variables ethnicity &
log(wage)
I Mean may be replaced by any valid command, e.g quantile
I
The Box plots & QQ (quantile-quantile) plots are often used
Research in Applied Econometrics Chapter 1. R R graphics
1 numerical & 1 categorical : Box plot
I A box plot is a crude representation of an empirical distribution
I The box is limited by “hinges” (1º & 3º quartiles) and show the median
I Outside of the box, 2 lines indicate the smallest & largest obs.
I within 1.5×size of the box from the closest hinge I Any obs. outside is represented by separate points I
boxplot(log(wage)~ethnicity)
1 numerical & 1 categorical : QQ plot
I A QQ plot matchesthe quantiles of 2 (empirical) distributions
I Recall that quantiles are quantities
I e.g. the 1º quartile of afam wage is the wage s.t. 25% of afam make less & 75% +
I If the 2 distributions are identical : QQ plot = diagonal I Otherwise, if e.g. cauc make more than afam, then
I with cauc on the x-axis, the QQ plot will be below the diag.
I A bit like the plot of income inequality, but with 2 variables I awage <- subset(CPS1988, ethnicity == "afam")$wage I cwage <- subset(CPS1988, ethnicity == "cauc")$wage I qqplot(awage, cwage)
I abline(0,1)overlay the diag (intercept 0, slope 1) I
detach(CPS1988) to close CPS1988
Research in Applied Econometrics Chapter 1. R Linear Regressions
Outline
R basic operations
R graphics
Linear Regressions
Discussing Regressors and Model Building
Basic Regression Commands in R
I
Linear Regression Model LRM
yi
=
xi0β+
iwith
i= 1...n
I In matrix formy =Xβ+ I
Typical Hyp. in cross-sections
I E(|X) = 0 (exogeneity)
I Var(|X) =σ2I (“sphericity” : homoscedasticity & no autoc.) I
In R, models are usually fitted by calling a command
I For the LRM in cross-section :fm <- lm(formula, data,...) I Argument ... replace a series of arguments
I describing the model
I or choosing the computation mode (algorithm) I or options
Research in Applied Econometrics Chapter 1. R Linear Regressions
Basic Regression Commands in R
I
The lm command returns an
objectI Here : the fitted model under the namefm I Maybe visualised in many ways or summarized
I
The lm object can be used to compute/extract :
I Predictions & fitted values, residuals, ... by means of fm$...
I Tests & several postestimations diagnostics I
Most estimation commands work the same way
IIdeally do SWIRL, course : Regression, lessons 1-7
I My course will then be easier
Multivariate Linear Regression with Factors
I
The purpose of this example is to demonstrate various R tools that are used to transform & combine regressors
I
Dataframe :
cps1988as before
Research in Applied Econometrics Chapter 1. R Linear Regressions
Wage Equation
I
Wage Equation
log (wage) =
β1+β
2exp+β3exp2+β
4education+β5ethnicity+
cps_lm<-lm(log(wage)~experience+I(experience^2) +education+ethnicity, data=CPS1988)
I
“Insulation function” I( )
I indicates to R that ^2 be understood as the square of exp I otherwise, R is unsure of the meaning and withdraws
experience^2
I That is because the model written in R issymbolicnot mathematic
I This might be clearer with a formula y ~ a + b+c I In a regression context, this meansy against 3 regressors I Instead y ~ a + I(b+c) meansy against 2 regressors
Results & Testing
I
summary(cps_lm)
I The return of education (to the wage) is 8.57%/year I % interpretation because wage is in log model I Categorical variables are managed by R
I that selects the reference cat.
I
Compare Nested Models : Anova (Analysis of Variance) Table
I More constraint model :cps_noeth<-lm(log(wage)~experience+ I(experience^2)+education, data=CPS1988)
I Usually, the test is on more than one variable I anova(cps_noeth,cps_lm)
I Compare non-nested model ? next year
Research in Applied Econometrics Chapter 1. R Linear Regressions
Interactions : effects of combined regressors
I
e.g. in labor econ : the combined effect of education &
ethnicity
I Does one year of Education have the same return for different ethnicities ?
I
This is modeled with
multiplicativeterms
I Considerlog (wage) =β1+β2ethnicity+β3ethnicity×education+β4education+
I Then∂log (wage)/∂education=β3ethinicity+β4
I Ifethinicity = 0, then the effect of 1 year of education isβ4
I Ifethinicity = 1, then the effect of 1 year of education is β3+β4
I
More examples
I Let a, b, c three factors so that each has several discrete levels I and x, y two continuous variables (quantitative)
Several Models/Formulas with Interactions
I
y~a+x : no interaction
I A single slope (of x) but one intercept for each level of factor a I
y~a*x : same as previous model +
I one interaction term for each level of a with x (different slopes) I In a more formal notation, letdai =I(a=i) :
[y∼a∗x]≡
"
y =βai
X
i
dai+γaixX
i
dai
#
I Note : y~a*x is the same as y~a+x+a*x
Research in Applied Econometrics Chapter 1. R Linear Regressions
Formulas with Interactions
I
y~(a+b+c)^2
I models all the interactions at 2 variables I but not at 3
I So this is like as many dichotomous variables as the number of levelsdai−bj=I(a=i∧b=j) for a & b
I and similarly for a & c and for c & b
I
SWIRL course Regression Models
I Lesson 8 : MultiVar Examples3Interactions Wage eq. : ethnicity & education
I cps_int<-lm(log(wage)~experience+I(experience^2) +education*ethnicity, data=CPS1988)
I Only one of the “+” fromcps_lm has been replaced by* I
coeftest(cps_int)
I A more compact version of summary( )
I That can also be used on some other regression commands I
The regression outputs the effects of education & ethnicity
I called “main effects”
I and the product of education & an indicator for the level
“afam” ofethnicity
I Why afam ? Probably because it is less numerous than cauc
Research in Applied Econometrics Chapter 1. R Linear Regressions
Interactions Wage eq. : ethnicity & education
I
afam has a neg. effect on the intercept
I lower average wage for african-american I AND on the slope ofeducationI lower return ofeducationfor african-american
I
The effect is not much significant though
I since a 5% significance with a sample of nearly 30 000 individuals is not much convincing
I Next year, we’ll see specifications in which this effect disappears
Predictions
I
First define the values for which you want to predict
I We simplify the model to exp. & educ. for ease of presentation I Let’s say we want to show the effect of exp. at an average level
of educ.
I
Create a new data frame with a column of average Educ & a column of all the possible values of Exp
I Note that in the Census, some people have negative experience !
I This is because Exp. is computed as age-education-6 so that people who complete their studies early may have “negative”
Exp
I
Use a predict( ) command on
I the lm object of interest : cps_lm here
I the new data set for which we want prediction : cps2 here I predict( ) can not only gives a prediction but also bounds I Plot that on the data
I
detach(CPS1988) when you are done to avoid confusion
Research in Applied Econometrics Chapter 1. R Discussing Regressors and Model Building
Outline
R basic operations
R graphics
Linear Regressions
Discussing Regressors and Model Building
When building a model, there are 2 contradictory forces
I
If we omit a regressor, and it is in fact relevant
I unobserved heterogeneity & inconsistency of LS estimators I if it is correlated to included regressors
I we sometimes can deal with that using instruments or panel I
If we include irrelevant regressor that are correlated with
relevant ones
I we create multicollinearity with the consequence that both relevant & irrelevant regressors may appear non-signif.
I That may even occur with 2 relevant regressors, e.g. in a Quantity-Price relation, the price of the subtitutes goods are relevant, but may be correlated with own price
Research in Applied Econometrics Chapter 1. R Discussing Regressors and Model Building
Collinearity – Endogeneity Trade-off
I
From a statistical point of view, 2 collinear variables carry the same information
I Their separate influence on the dependant variable cannot be assessed in the present sample
I Be pragmatic : reject one of the 2 or merge them in some way that makes sense in context
I
It is not really possible to escape such a trade-off
I Especially since in a particular sample, a relevant regressor may coincidentally appear non significant (if the sample is not large) I
Theory does not help by nature
I since an empirical model is a trial of a model I theory helps interpreting results, not guide them
Progressive Inclusion
I
is an old way of looking at model building
1.
Among potential regressors
x, take the one with highestcorrelation with
y2.
Regress
yon that single regressor
I Is it significant ?I No : you don’t have a model
I Yes : estimate the one-regressor model & compute its residuals
3.
Among the remaining regressors, take the one with highest correlation with the residuals
4.
Repeat previous steps with progressively more regressors
I Until one that is non-significantResearch in Applied Econometrics Chapter 1. R Discussing Regressors and Model Building
Progressive Inclusion
I
The issue with this approach is that if there are several relevant regressors
I then at least the first step might be inconsistent I because at least one relevant regressor is missing
I
This is a very serious issue that leads to non-sensical results
Progressive Elimination
I
Instead, consider the “largest reasonnable set of regressors”
I can be linked to the theory you want to test or to previous experience
I
It is risky to just run this “encompassing” regression and report the results
I because of multicollinearity
Research in Applied Econometrics Chapter 1. R Discussing Regressors and Model Building
Progressive Elimination
I
Gradually remove regressors one by one
I Examine how the estimates of the remaining regressors evolve I Use ANOVA to test over several regressors
I If there is a noticeable increase in significance I but not so much change in estimates I collinearity was an issue
I If estimated coefficients change wildly I omitted regressor endogeneity
I
However
I dropping collinear regressor could lead to jumps in coef estimates
I after all, collinearity affects their variance
I dropping a relevant regressor does not necessarily lead to major changes in the other coef
I when that regressor is not much correlated to the others
Summing up
I
Model
yi=
β0+
β1x1i+
β2x2i+
i(no missing relevant regressor)
I estimation by MCO whenx2 andx1 are correlated I if they are not, there is NO serious consequences for ˆβ1
I “not relevant but correlated to a relevant regressor” might not be empirically common
x2
Consequences on ˆ
β1on ˆ
β2relevant included Best case, but still may appear insignificant
not incl. Inconsistent –
not relev. included May appear insignificant should
→0
not incl. No problem presumably –
Research in Applied Econometrics Chapter 1. R Discussing Regressors and Model Building
Essay #1
In R, type data(package="AER")
From the list of data sets, select the one closest to your family name. Use help(dataset name) for a description of the data set.
Write an R code that uses this dataset to :
1.
Plot a scatterplot, an histogram, a spine plot (between categorical variables) and a boxplot.
2.
Select a dependant variable and a couple of explanatory variables. Regress the former against the latter, with and without interactions (so : 2 regressions). This need not make much economic sense, as the focus is on the R code, but it is better when it is a sensible economic regression.
3.