Research in Applied Econometrics Chapter 1. R
Research in Applied Econometrics Chapter 1. R
Pr. Philippe Polomé, Université Lumière Lyon 2
M1 APE Analyse des Politiques Économiques M1 RISE Gouvernance des Risques Environnementaux
2018 – 2019
Research in Applied Econometrics Chapter 1. R SWIRL
Outline
SWIRL
Data Management R graphics
Linear Regressions
Discussing Regressors and Model Building
Document Edition Functionalities
Research in Applied Econometrics Chapter 1. R SWIRL
SWIRL
I
Do Course 1 : R programming, Lessons 1-9 + 14 by yourself
I To quit a lesson : escI Answer “no” to any proposition to “register”
I Following ...
I press←-
I Sometimes, much text is to be read – that is a good exercice
I
Follow the commands in the RAE2017.R
I They follow the slidesI
We do just Lesson 1
I To make sure you can start the other lessons by yourself
Research in Applied Econometrics Chapter 1. R SWIRL
SWIRL R programming overview
1 : Basic Building Blocks 2 : Workspace and Files 3 : Sequences of Numbers 4 : Vectors
5 : Missing Values 6 : Subsetting Vectors 7 : Matrices and Data Frames 8 : Logic
9 : Functions 10 : lapply and sapply
11 : vapply and tapply 12 : Looking at Data
13 : Simulation 14 : Dates and Times
15 : Base Graphics
Research in Applied Econometrics Chapter 1. R SWIRL
Use RAE.r : A few commands outside of SWIRL
I
In R-Studio, you should have created a new project (upper right button)
I and called it “RAE” for example I Stored it where you can find it back I If not, do it now
I
Execute the commands on RAE2017.R to see the output
IUsual math functions : log, exp, sign, sqrt, abs, min, max
I log(exp(sin(pi/4)^2)*exp(cos(pi/4)^2))Type in Console←- I
Special vectors
I ones <- rep(1, 10)
I even <- seq(from = 2, to = 20, by =2) I trend <- 1981 :2005
I
diag(4) Identity mtx of size 4
Research in Applied Econometrics Chapter 1. R SWIRL
Mtx Operations
I
A<-matrix(1 :6, nrow = 2)
I Alook what it looks like & how R gives the position of the elements
I Look @ your environment window : A is now there I It remains in you project until erased (the brush)
I
t(A) = transpose of A (
not A’)
Idim(A) = dimensions of A (R then C)
Inrow(A) ; ncol(A) nbr R ; C
I
A[i,j] extract element (i,j)
I Does not remove it from the mtx
I
A[,j] extract C j (all the R) into one vector
I A[i,]same for R iI
A1<-A[1 :2, c(1, 3)] A1 has 2 R containing the elts in R 1 to 2 and C 1 & 3 from A
I For this particular mtx, same result w/A[,-2]
Research in Applied Econometrics Chapter 1. R SWIRL
Mtx Operations
I
det(A1) determinant
Isolve(A1) inverse
IA %*% B mtx product
I A*Aelement-by-element product
I
crossprod(A, B) efficient calculation of A’B
Idiag(A1) extract diag
I
cbind(1, A1) “combine” one C of ones and A1
. . .
. → . .
I
rbind(A1, diag(4, 2)) “stack” A1 & a diag mtx of size 2 w/ 4 on the diag
. . . .
↑ . . . .
Research in Applied Econometrics Chapter 1. R Data Management
Outline
SWIRL
Data Management
R graphics
Linear Regressions
Discussing Regressors and Model Building
Document Edition Functionalities
Research in Applied Econometrics Chapter 1. R Data Management
Dataframe
I
“Frame” = “context”
I In R, a “Dataframe” is a data mtx I a collection of vectors of same length I Stacked together horizontaly
I
Each vector = 1 C = “variable”
I Possibly of different natures
I quantitative, numeric but qualitative, characters, dates...
I it may further contain meta-data I e.g. variable type or categories name
I
Each row = 1 obs in the sample
Research in Applied Econometrics Chapter 1. R Data Management
Dataframe Creation
I
Several ways
I keyboard (cfr Swirl programming lesson 7) I read R file
I import I
keyboard example
I alternative 1
I mydata <- data.frame(one = 1 :10, two = 11 :20, three = 21 :30)
I alternative 2
I mydata <- as.data.frame(matrix(1 :30, ncol=3))and names(mydata) <- c(“one”, “two”, “three”)
I
R is not very good for encoding data manually
I But we use this example to explain attachment (below)
Research in Applied Econometrics Chapter 1. R Data Management
attach
I
A dataframe is “attached”
I w/ commandattach
I then variables’ names in the dataframe maybe used directly in commands
I
For example
I mean(two)produce an error message
I attach(mydata) and thenmean(two)produces the average of variable “two”
I
detach(mydata) is self-explanatory
I Why detach ? e.g. to avoid confusionsResearch in Applied Econometrics Chapter 1. R Data Management
Subset Selection
I
As seen in
swirla subset of a Dataframe can be accessed by [ or $
I $ extract a single variable
I
The command
subsetsometimes work better (e.g.
conditional selection)
I e.g.mydata.sub<-subset(mydata, two<=16, select = -two) I selects all the obs. of variables one & three
I fow which the obs of variable 2 are≤16
Research in Applied Econometrics Chapter 1. R Data Management
Export ( write ) a dataframe
I
write.table(mydata, file=“mydata.txt”, col.names=TRUE)
I create a txt file mydata.txt in the working directoryI normally where your project is I Meta-data are not passed
I The text file format is
“one” “two” “three”
“1” 1 11 21
“2” 2 12 22
...
I So that it looks like the C headers are shifted left
I Take that into account accordingly w/ the software you use to open it
Research in Applied Econometrics Chapter 1. R Data Management
Import ( read ) a dataframe
I
The Environment window has a button that makes it easy
I a preview is generatedI Use the "import dataset" button in the Environment window to read "mydata.txt" back into R
I
Import from another software : excel, stata, sas...
I Easiest : if you have access to the software, export the data file in txt or csv
I loss of meta-data
I R-Studio proposes several formats
I It does not work often as these softwares change their formats often
I Use Google
I e.g. “R import Stata 17 data”
I Alsowww.statmethods.net/input/importingdata.html I for a few formats
Research in Applied Econometrics Chapter 1. R R graphics
Outline
SWIRL
Data Management
R graphicsLinear Regressions
Discussing Regressors and Model Building
Document Edition Functionalities
Research in Applied Econometrics Chapter 1. R R graphics
Plot
I
First SWIRL
I course R-programming, lesson 15 Base graphics I
A few additional graphic elements using package
plotI Packageslattice ggplot2are better
I varianceexplained.org/RData/code/code_lesson2/
I R has many publication-quality graphics I But they are not very intuitive
I
plot( ) is the default graphic command for many objects :
I dataframes, time series, fitted linear modelsI it is also an old, crude, command
Research in Applied Econometrics Chapter 1. R R graphics
Examples with data("CPS1988")
I
Data file is
cps1988preloaded in the AER package
I Pop. survey March 1988, US Census Bureau I 28 155 obs., cross-sectionI Men, 18-70 y-o
I Income > US$ 50 in 1988
I Not self-employed, not working w/o salary I
summary(CPS1988)
I
Quantitative data
I wage$/weekI education&experience(=age-education-6) in years
Research in Applied Econometrics Chapter 1. R R graphics
“Scatterplots” – dispersion – XY
I
Probably the + common in stat, with histograms
I We use CPS1988 : a census data file on wage and itsdeterminants
I From the AER package I
attach(CPS1988)
I plot(education, log(wage))
I First is on arg in x-axis, 2nd in y-axis
I To export a plot : “Export” button in Plots window I There are several formats
I png is easiest to use in word processing
I
detach(CPS1988)
I
plot(log(wage)~education, data=CPS1988)
I alternative to avoid attaching the dataframeResearch in Applied Econometrics Chapter 1. R R graphics
R Graphic Parameters
I
A plot results may be modified in many ways
I E.g. argumenttypecontrols if the plot is made points (type = p), lines (type = l), both (type = b), steps (type = s) or others
I
Several dozens parameters may be modified
I See ?parI They may be modifiedafterthe plot w/ commandpar( ) I Or they can be supplied in theplot( ) command e.g.
plot(log(wage)~education, data=CPS1988, pch=20, col="blue", ylim=c(4,10), xlim=c(0,20), main="Wage by education years")
Research in Applied Econometrics Chapter 1. R R graphics
R Graphic Parameters
I
Add layer(s) to a plot : lines( ), points( ), text( ), legend( )
I Add a straight lineabline(a, b)I a intercept, b slope
I
1 plot over another
I x <- rnorm(50) I x2 <- rnorm(50, -1)I plot(ecdf(x), xlim = range(c(x, x2))) I ecdf empirical cumulative density function I plot(ecdf(x2), add = TRUE, lty = "dashed")
I
Barplots, pie charts, boxplots, QQ plots & histograms
I barplot( ), pie( ), boxplot( ), qqplot( ), hist( ) I We’ll see laterResearch in Applied Econometrics Chapter 1. R R graphics
Histograms & boxplots
I
Continue w/ CPS1988 data base on wage & its determinants
I summary(CPS1988)reveals that some variables are categorical I Categorical : calledfactorsin RI Factors
are vectors of categories
I sometimes w/metadataI e.g. categories names I g <- rep(0 :1, c(2,4))
I g <- factor(g, levels=0 :1, labels=c("male", "female")) I Name categories (0,1) of g into “Male”(=0) & “Female”
I so g is [1] male male female female female female
Research in Applied Econometrics Chapter 1. R R graphics
Factors in CPS1988
I
In CPS1988, the factors are
I ethnicityis caucasian “cauc” & african-american “afam”
I smsaresidence in urban area I region
I parttimeworks part-time I
Plots according to data type
I Numerical/Quantitative or categorical I Single variable or 2 in relation
Research in Applied Econometrics Chapter 1. R R graphics
One numerical variable : histogram & density
I
hist(wage, freq=FALSE)
I optionfreq=FALSEI relative frequencies, else absolute (counting) I optionbinwidth=zzz
I “bin” = container : chose the length of the base of the rectangles
I
hist(log(wage), freq=FALSE)
Ilines(density(log(wage)), col=4)
I Commanddensity is actually a non-parametric estimate of the density function (next year)
I
Remarks
I log distribution is less asymetrical than the raw data I data in log are often closer to a normal
I That is often the case w/ econ. data & a rationale for the normal hypothesis
Research in Applied Econometrics Chapter 1. R R graphics
One categorical
I
W/ categorical data
I Mean & variance have no meaning I But frequencies do
I
summary(region) : absolute frequencies (counts)
I
tab <- table(region) : stores these freq. in a table called tab
Iprop.table(tab) computes the proportions (relative freq.)
IBarplots & pie visualise often quite well cat. data
I barplot(tab) I pie(tab)
I These plots can be modified using parameters
Research in Applied Econometrics Chapter 1. R R graphics
2 categorical
I
Usually presented in a Contingency Table
I xtabs( )w/ aformula interface :I e.g.xtabs(~ ethnicity + region, data = CPS1988) I data is optional si it is stillattached
I table(ethnicity, region)mêmes résultats I
A plot of that is a “spine plot”
I plot(ethnicity ~ region)Formula
I plot(ethnicity, region)What differences ?
Research in Applied Econometrics Chapter 1. R R graphics
2 numerical
I
The Correlation Coefficient
ris typical
I For positive & asymetrical variables : Spearman’sρ
I rankscorrelation, instead of values, is often prefered becauser is not robust to asymetry
I
cor(log(wage), education)
I
cor(log(wage), education, method="spearman")
I Results differ a bitI
plot(log(wage)~education)
I scatterplot shows little correlation
I but log makes it difficult to see graphically
Research in Applied Econometrics Chapter 1. R R graphics
1 numerical & 1 categorical
I
Often, conditionnal moments are calculated
I e.g. average wage by ethnicityI tapply(log(wage), ethnicity, mean)
I “Applies” the cmd “mean” on the 2 variables ethnicity &
log(wage)
I Mean maybe replaced by any valid cmd, e.g quantile
I
The Box plots & QQ (quantile-quantile) plots are often used
Research in Applied Econometrics Chapter 1. R R graphics
1 numerical & 1 categorical : Box plot
I A box plot is a crude representation of an empirical distribution
I The box is limited by “hinges” (1º & 3º quartiles) and show the median
I Outside of the box, 2 lines indicate the smallest & largest obs.
I within 1.5×size of the box from the closest hinge I Any obs. outside is represented by separate points I
boxplot(log(wage)~ethnicity)
Research in Applied Econometrics Chapter 1. R R graphics
1 numerical & 1 categorical : QQ plot
I A QQ plot matchesthe quantiles of 2 (empirical) distributions
I Recall that quantiles are quantities
I e.g. the 1º quartile of afam wage is the wage s.t. 25% of afam make less & 75% +
I If the 2 distributions are identical : QQ plot = diagonal I Otherwise, if e.g. cauc make more than afam, then
I with cauc on the x-axis, the QQ plot will be below the diag.
I A bit like the plot of income inequality, but w/ 2 var.
I awage <- subset(CPS1988, ethnicity == "afam")$wage I cwage <- subset(CPS1988, ethnicity == "cauc")$wage I qqplot(awage, cwage)
I abline(0,1)overlay the diag (intercept 0, slope 1) I
detach(CPS1988) to close CPS1988
Research in Applied Econometrics Chapter 1. R Linear Regressions
Outline
SWIRL
Data Management R graphics
Linear Regressions
Discussing Regressors and Model Building
Document Edition Functionalities
Research in Applied Econometrics Chapter 1. R Linear Regressions
Basic Regression Commands in R
I
Linear Regression Model LRM
yi
=
xi0β+
iw/
i= 1...n
I In mtx formy=Xβ+ I
Typical Hyp. in cross-sections
I E(|X) = 0 (exogeneity)
I Var(|X) =σ2I (“sphericity” : homoscedasticity & no autoc.) I
In R, models are usually fitted by calling a cmd
I For the LRM in cross-section :fm <- lm(formula, data,...) I Argument ... replace a series of arguments
I describing the model
I or choosing the computation mode (algorithm) I or options
Research in Applied Econometrics Chapter 1. R Linear Regressions
Basic Regression Commands in R
I
The lm cmd returns an
objectI Here : the fitted model under the namefm I Maybe visualised in many ways or summarized
I
The lm object can be used to compute :
I Predictions & fitted values, residuals, ... by means of fm$... see RAE2017
I Tests & several postestimations diagnostics I
Most estimation commands work the same way
Research in Applied Econometrics Chapter 1. R Linear Regressions
SWIRL
I
Do Lessons 1-6, course « Regression Models » in Swirl
I The others : laterI Concentrate on code, you know the econometrics
I Think of closing files that may have remained opened from the previous session
1.
“Introduction”
I To remember “A coefficient will be within 2 standard errors of its estimate about 95% of the time”
2.
“Residuals” is + difficult (reading + programming +concepts)
I Explains loopsI Forces to re-read previous cmds
I Make sure to execute program res_eqn.r when it shows up
Research in Applied Econometrics Chapter 1. R Linear Regressions
SWIRL
3.
“Least Squares Estimation” – nothing in particular
4.Introduction to Multivariable Regression
I Installmanipulatepreviously
I I am not sure of the stability of this lesson
I Do not edit the functionmyplotwhich will show up
I ! cor(gpa_nor, gch_nor) will be6= ˆβ, SWIRL expects =, so a bug
5.
“Residual Variation”
I “Gaussian elimination” shows that a k-regressors regression I may be seen as a succession of k 1-regressor regressions I DO NOT interpret this as model building or presentation or a
way to select results
6.
“MultiVar Examples” – nothing in particular
Research in Applied Econometrics Chapter 1. R Linear Regressions
Multivariare Linear Regression w/ Factors
I
The purpose of this example is to demonstrate various R tools
I that are used to transform & combine regressorsI
Dataframe :
cps1988as before
ISWIRL Course « Regression Models »
I lesson 7 : “MultiVar Examples2”
I Plots window for BoxPlot I sapply : use help in help window
Research in Applied Econometrics Chapter 1. R Linear Regressions
Wage Equation
I
Wage Equation
log (wage) =
β1+β
2exp+β3exp2+β
4education+β5ethnicity+
cps_lm<-lm(log(wage)~experience+I(experience^2) +education+ethnicity, data=CPS1988)
I
“Insulation function” I( )
I indicates to R that ^2 be understood as the square of exp I otherwise, R is unsure of the meaning and withdraws
experience^2
I This might be clearer w/ a formula y ~ a + (b+c)
I Are there 2 variables on the RHS of the formula : a et (b+c), or are there 3 ?
I To clarify, write y ~ a + I(b+c)
Research in Applied Econometrics Chapter 1. R Linear Regressions
Results & Testing
I
summary(cps_lm)
I The return of education (to the wage) is 8.57%/year I % interpretation because wage is in log model I Categorical variables are managed by R
I that selects the reference cat.
I
Compare Nested Models : Anova (Analysis of Variance) Table
I Regression + constraintI cps_noeth<-lm(log(wage)~experience+
I(experience^2)+education, data=CPS1988) I Usually, the test is on + than one variable I anova(cps_noeth,cps_lm)
Research in Applied Econometrics Chapter 1. R Linear Regressions
Interactions : effects of combined regressors
I
e.g. in labor econ : the combined effect of education &
ethnicity
I Does one year of Education have the same return for different ethnicities ?
I
This is modeled w/
multiplicativeterms
I Considerlog (wage) =β1+β2ethnicity+β3ethnicity×education+β4education+
I Then∂log (wage)/∂education=β3ethinicity+β4
I Ifethinicity = 0, then the effect of 1 year of education isβ4
I Ifethinicity = 1, then the effect of 1 year of education is β3+β4
I
Let a, b, c three factors
I so that each has several discrete levels
I
and x, y two continuous variables (quantitative)
Research in Applied Econometrics Chapter 1. R Linear Regressions
Several Models/Formulas with Interactions
I
y~a+x : no interaction
I A single slope (of x) but one intercept for each level of factor a I
y~a*x : same as previous model +
I one interaction term for each level of a with x (different slopes) I In a more formal notation, letdai =I(a=i) :
[y∼a∗x]≡
"
y =βaiX
i
dai+γaixX
i
dai
#
Research in Applied Econometrics Chapter 1. R Linear Regressions
Formulas with Interactions
I
y~(a+b+c)^2
I models all the interactions at 2 variables I but not at 3
I So this is like as many dichotomous var. as the nbr of levels dai−bj=I(a=i∧b=j) for a & b
I and similarly for a & c and for c & b
I
SWIRL course Regression Models
I Lesson 8 : MultiVar Examples3Research in Applied Econometrics Chapter 1. R Linear Regressions
Interactions Wage eq. : ethnicity & education
I cps_int<-lm(log(wage)~experience+I(experience^2) +education*ethnicity, data=CPS1988)
I Only one of the “+” fromcps_lm has been replaced by* I
coeftest(cps_int)
I A + compact version of summary( )
I That can also be used on some other regression cmds I
The regression outputs the effects of education & ethnicity
I called “main effects”
I and the product of education & an indicator for the level
“afam” ofethnicity
I Why afam ? Probably because it is less numerous than cauc
Research in Applied Econometrics Chapter 1. R Linear Regressions
Interactions Wage eq. : ethnicity & education
I
afam has a neg. effect on the intercept
I lower average wage for african-american I AND on the slope ofeducationI lower return ofeducationfor african-american
I
The effect is not much significant though
I since a 5% significance with a sample of nearly 30 000 individuals is not much convincing
Research in Applied Econometrics Chapter 1. R Linear Regressions
Predictions
I
First define the values for which you want to predict.
I We simplify the model to exp. & educ. for ease of presentation I Let’s say we want to show the effect of Exp. at an average
level of Educ.
I
Create a new data frame w/ a C of average Educ & a C of all the possible values of Exp
I Note that in the Census, some people have negative experience !
I This is due to the way we compute Exp.
I
Use a predict( ) cmd on
I the lm object of interest : cps_lm here
I the new data set for which we want prediction : cps2 here I predict( ) can not only gives a prediction but also bounds I Plot that on the data
I
detach(CPS1988) when you are done to avoid confusion
Research in Applied Econometrics Chapter 1. R Discussing Regressors and Model Building
Outline
SWIRL
Data Management R graphics
Linear Regressions
Discussing Regressors and Model Building
Document Edition Functionalities
Research in Applied Econometrics Chapter 1. R Discussing Regressors and Model Building
When building a model, there are 2 contradictory forces
I
If we omit a regressor, and it is in fact relevant
I unobserved heterogeneity & inconsistency of LS estimators I if it is correlated to included regressors
I we sometimes can deal w/ that using instruments or panel I
If we include irrelevant regressor that are correlated w/
relevant ones
I we create multicollinearity w/ the csqce that both relevant &
irrelevant regressors may appear non-signif.
I That may even occur w/ 2 relevant regressors, e.g. in a Quantity-Price relation, the price of the subtitutes goods are relevant, but may be correlated w/ own price
Research in Applied Econometrics Chapter 1. R Discussing Regressors and Model Building
Collinearity – Endogeneity Trade-off
I
From a statistical point of view, 2 collinear variables carry the same information
I Their separate influence on the dependant variable cannot be assessed in the present sample
I Be pragmatic : reject one of the 2 or merge them in some way that makes sense in context
I
It is not really possible to escape such a trade-off
I Especially since in a particular sample, a relevant regressor may coincidentally appear non significant (if the sample is not large) I
Theory does not help by nature
I since an empirical model is a trial of a model I theory helps interpreting results, not guide them
Research in Applied Econometrics Chapter 1. R Discussing Regressors and Model Building
Progressive Inclusion
I
is an old way of looking at model building
1.
Among potential regressors
x, take the one w/ highestcorrelation w/
y2.
Regress
yon that single regressor
I Is it significant ?I No : you don’t have a model
I Yes : estimate the one-regressor model & compute its residuals
3.
Among the remaining regressors, take the one w/ highest correlation w/ the residuals
4.
Repeat previous steps with progressively more regressors
I Until one that is non-significantResearch in Applied Econometrics Chapter 1. R Discussing Regressors and Model Building
Progressive Inclusion
I
The issue w/ this approach is that if there are several relevant regressors
I then at least the first step might be inconsistent I because at least one relevant regressor is missing
I
This is a very serious issue that leads to non-sensical results
Research in Applied Econometrics Chapter 1. R Discussing Regressors and Model Building
Progressive Elimination
I
Instead, consider the “largest reasonnable set of regressors”
I can be linked to the theory you want to test or to previous experience
I
It is risky to just run this “encompassing” regression and report the results
I because of multicollinearity
Research in Applied Econometrics Chapter 1. R Discussing Regressors and Model Building
Progressive Elimination
I
Gradually remove regressors one by one
I Examine how the estimates of the remaining regressors evolve I If there is a noticeable increase in significance
I but not so much change in estimates I collinearity was an issue
I If estimated coefficients change wildly I omitted regressor endogeneity
I
However
I dropping collinear regressor could lead to jumps in coef estimates
I after all, collinearity affects their variance
I dropping a relevant regressor does not necessarily lead to major changes in the other coef
I when that regressor is not much correlated to the others
Research in Applied Econometrics Chapter 1. R Discussing Regressors and Model Building
Summing up
I
Model
yi=
β0+
β1x1i+
β2x2i+
i(no missing relevant regressor)
I estimation by MCO whenx2 andx1 are correlated I if they are not, there is NO serious consequences for ˆβ1
I “not relevant but correlated to a relevant regressor” might not be empirically common
x2
Consequences on ˆ
β1on ˆ
β2relevant
included May appear insignificant
not incl. Inconsistent –
not relev.
included May appear insignificant should
→0
not incl. ? ? –
Research in Applied Econometrics Chapter 1. R Document Edition Functionalities
Outline
SWIRL
Data Management R graphics
Linear Regressions
Discussing Regressors and Model Building
Document Edition FunctionalitiesResearch in Applied Econometrics Chapter 1. R Document Edition Functionalities
Writing with R
I
A few packages are designed to use R to write reports directly
1.The text is written directly in the script in the Editor window
I Math formulas in latex may be included
I Of course, R commands (graphics, regressions...) 2.
If the data change, or the model, everything is adjusted
automatically
3.
L
ATEX helps choose an appropriate format
I report, paper, presentationResearch in Applied Econometrics Chapter 1. R Document Edition Functionalities
SWeave – Knitr – Markdown
I SWeave
simply send the whole script to L
ATEX
I knitr
does the same but combine other packages and solve some issues in SWeave
I Markdown
is the current standard
I The script is directly printed using LATEX or .doc (Word) or html (webpage)
I Self-teach (I won’t look into it)
I http ://rmarkdown.rstudio.com/lesson-1.html
I https ://www.r-bloggers.com/how-to-create-reports-with-r- markdown-in-rstudio/