• Aucun résultat trouvé

Research in Applied Econometrics Chapter 1. R

N/A
N/A
Protected

Academic year: 2022

Partager "Research in Applied Econometrics Chapter 1. R"

Copied!
48
0
0

Texte intégral

(1)

Research in Applied Econometrics Chapter 1. R

Pr. Philippe Polomé, Université Lumière Lyon 2

M1 APE Analyse des Politiques Économiques M1 RISE Gouvernance des Risques Environnementaux

2020 – 2021

(2)

Research in Applied Econometrics Chapter 1. R R basic operations

Outline

R basic operations

R graphics

Linear Regressions

Discussing Regressors and Model Building

(3)

Are you operational ?

I

Create a project “RAE” in the folder for your courses

I Download theRAE2020.R from my courses’ site

I Into the same directory as your project I Open it from R-studio Editor

I

R-Studio recalls the projects

I You can go from one to another

I All the files written on disk remain available I

swirl should be installed

I Load it now e.g. library(swirl) I Typeswirl( )in concole

I do course 1 : R programming, Lesson 1 I I would like to see how you are doing

I You should have done it and 13 lessons by yourself, from home

(4)

Research in Applied Econometrics Chapter 1. R R basic operations

Use the code file

I

Execute some commands on the code file to see the output

I

The packages AER & DCchoice must be installed for this

course

I

Usual math functions : log, exp, sign, sqrt, abs, min, max

I log(exp(sin(pi/4)^2)*exp(cos(pi/4)^2))Type in Console←- I

Special vectors

I ones <- rep(1, 10)

I even <- seq(from = 2, to = 20, by =2) I trend <- 1981 :2005

I

diag(4) Identity matrix of size 4

(5)

Matrix Operations

I

A<-matrix(1 :6, nrow = 2)

I Alook what it looks like & how R gives the position of the elements

I Look at your environment window : A is now there I It remains in you project until erased (the brush)

I

t(A) = transpose of A (

not A’

)

I

dim(A) = dimensions of A (Row then Column)

I

nrow(A) ; ncol(A) nbr R ; C

I

A[i,j] extract element (i,j)

I Does not remove it from the matrix I

A[,j] extract C j (all the R) into one vector

I A[i,]same for R i

I

A1<-A[1 :2, c(1, 3)] A1 has 2 R containing the elts in R 1 to 2 and C 1 & 3 from A

I For this particular matrix, same result withA[,-2]

(6)

Research in Applied Econometrics Chapter 1. R R basic operations

Matrix Operations

I

det(A1) determinant

I

solve(A1) inverse

I

A %*% B matrix product

I A*Aelement-by-element product

I

crossprod(A, B) efficient calculation of A’B

I

diag(A1) extract diag

I

cbind(1, A1) “combine” one C of ones and A1

. . .

.. .

I

rbind(A1, diag(4, 2)) “stacks” A1 & a diag matrix of size 2 with 4 on the diag

. . . .

. . . .

(7)

Dataframe

I

“Frame” = “context”

I In R, a “Dataframe” is a data matrix I a collection of vectors of same length I Stacked together horizontaly

I

Each vector = 1 C = “variable”

I Possibly of different natures

I quantitative, numeric but qualitative, characters, dates...

I it may contain meta-data

I e.g. variable type or categories name

I

Each R = 1 obs in the sample

(8)

Research in Applied Econometrics Chapter 1. R R basic operations

Dataframe Creation

I

Several ways

I keyboard (cf. Swirl R-programming lesson 7) I read R file

I import I

keyboard example

I alternative 1

I mydata <- data.frame(one = 1 :10, two = 11 :20, three = 21 :30)

I alternative 2

I mydata <- as.data.frame(matrix(1 :30, ncol=3))and names(mydata) <- c(“one”, “two”, “three”)

I

R is not very good for encoding data manually

I But we use this example to explain attachment (below)

(9)

attach

I

A dataframe is “attached”

I with commandattach

I then variables’ names in the dataframe maybe used directly in commands

I

For example

I mean(two)produce an error message

I attach(mydata) and thenmean(two)produces the average of variable “two”

I

detach(mydata) is self-explanatory

I Why detach ? e.g. to avoid confusions

(10)

Research in Applied Econometrics Chapter 1. R R basic operations

Subset Selection

I

As seen in

swirl

a subset of a Dataframe can be accessed by [ or $

I $ extract a single variable

I

The command

subset

sometimes work better (e.g.

conditional selection)

I e.g.mydata.sub<-subset(mydata, two<=16, select = -two) I selects all the obs. of variables one & three

I fow which the obs of variable 2 are≤16

(11)

Export ( write ) a dataframe

I

write.table(mydata, file=“mydata.txt”, col.names=TRUE)

I create a .txt file mydata.txt in the working directory

I normally where your project is I Meta-data are not passed

I The text file format is

“one” “two” “three”

“1” 1 11 21

“2” 2 12 22

...

I So that it looks like the column headers are shifted left I Take that into account accordingly with the software you use

to open it

(12)

Research in Applied Econometrics Chapter 1. R R basic operations

Import ( read ) a dataframe

I

The Environment window has a button that makes it easier

I a preview is generated

I Use the "import dataset" button in the Environment window to read "mydata.txt" back into R

I

Import from another software : excel, stata, sas...

I Easiest : if you have access to the software, export the data file in txt or csv

I loss of meta-data

I R-Studio proposes several formats

I It does not work often as these softwares change their formats often

I Use Google

I e.g. “R import Stata 17 data”

I Alsowww.statmethods.net/input/importingdata.html I for a few formats

(13)

Outline

R basic operations

R graphics

Linear Regressions

Discussing Regressors and Model Building

(14)

Research in Applied Econometrics Chapter 1. R R graphics

Plot

I

You have seen some plots on the course R-programming, lesson 15 Base graphics

I

A few additional graphic elements using package

plot I Packageslattice ggplot2are better

I varianceexplained.org/RData/code/code_lesson2/

I R has many publication-quality graphics I But they are not very intuitive

I

plot( ) is the default graphic command for many objects :

I dataframes, time series, fitted linear models

I it is also an old, crude, command

I but many R programmers have connected their packages with it

(15)

Examples with data("CPS1988")

I

Data file is

cps1988

preloaded in the AER package

I Pop. survey March 1988, US Census Bureau I 28 155 obs., cross-section

I Men, 18-70 year-old I Income > US$ 50 in 1988

I Not self-employed, not working without salary I

summary(CPS1988)

I

Quantitative data

I wage$/week

I education& (potential work)experienceyears

(16)

Research in Applied Econometrics Chapter 1. R R graphics

“Scatterplots” – dispersion – XY

I

Probably the + common in stat, with histograms

I We use CPS1988 : a census data file on wage and its

determinants

I From the AER package I

attach(CPS1988)

I plot(education, log(wage))

I First is on arg in x-axis, 2nd in y-axis

I To export a plot : “Export” button in Plots window I There are several formats

I png is easiest to use in word processing

I

detach(CPS1988)

I

plot(log(wage)~education, data=CPS1988)

I alternative to avoid attaching the dataframe

(17)

R Graphic Parameters

I

A plot results may be modified in many ways

I E.g. argumenttypecontrols if the plot is made points (type = p), lines (type = l), both (type = b), steps (type = s) or others

I

Several dozens parameters may be modified

I See ?par

I They may be modifiedafterthe plot with commandpar( ) I Or they can be supplied in theplot( ) command e.g.

plot(log(wage)~education, data=CPS1988, pch=20, col="blue", ylim=c(4,10), xlim=c(0,20), main="Wage by education years")

(18)

Research in Applied Econometrics Chapter 1. R R graphics

R Additional Graphics

I

Add layer(s) to a plot : lines( ), points( ), text( ), legend( )

I Add a straight lineabline(a, b)

I a intercept, b slope

I

Barplots, pie charts, boxplots, QQ plots & histograms

I barplot( ), pie( ), boxplot( ), qqplot( ), hist( ) I We’ll see below

(19)

Histograms & boxplots

I

Continue with CPS1988 data base on wage & its determinants

I summary(CPS1988)reveals that some variables are categorical I Categorical : calledfactorsin R

I Factors

are vectors of categories

I sometimes withmetadata

I e.g. categories names I g <- rep(0 :1, c(2,4))

I g <- factor(g, levels=0 :1, labels=c("male", "female")) I Name categories (0,1) of g into “Male”(=0) & “Female”

I so g is [1] male male female female female female

(20)

Research in Applied Econometrics Chapter 1. R R graphics

Factors in CPS1988

I

In CPS1988, the factors are

I ethnicityis caucasian “cauc” & african-american “afam”

I smsaresidence in urban area I region

I parttimeworks part-time I

Plots according to data type

I Numerical/Quantitative or categorical I Single variable or 2 in relation

(21)

One numerical variable : histogram & density

I

hist(wage, freq=FALSE)

I optionfreq=FALSE

I relative frequencies, else absolute (counting) I optionbinwidth=zzz

I “bin” = container : chose the length of the base of the rectangles

I

hist(log(wage), freq=FALSE)

I

lines(density(log(wage)), col=4)

I Commanddensity is actually a non-parametric estimate of the density function

I

Remarks

I log distribution is less asymetrical than the raw data I data in log are often closer to a normal

I That is often the case with econ. data & a rationale for the normal hypothesis

(22)

Research in Applied Econometrics Chapter 1. R R graphics

One categorical

I

With categorical data

I Mean & variance have no meaning I But frequencies do

I

summary(region) : absolute frequencies (counts)

I

tab <- table(region) : stores these freq. in a table called tab

I

prop.table(tab) computes the proportions (relative freq.)

I

Barplots & pie visualise often quite well cat. data

I barplot(tab) I pie(tab)

I These plots can be modified using parameters

(23)

2 categorical

I

Usually presented in a Contingency Table

I xtabs( )with a formulainterface :

I e.g.xtabs(~ ethnicity + region, data = CPS1988) I data is optional since it is stillattached

I table(ethnicity, region)mêmes résultats I

A plot of that is a “spine plot”

I plot(ethnicity ~ region)Formula

I plot(ethnicity, region)What differences ?

(24)

Research in Applied Econometrics Chapter 1. R R graphics

2 numerical

I

The Correlation Coefficient

r

is often used

I For positive & asymetrical variables : Spearman’sρ

I rankscorrelation, instead of values, is often prefered becauser is not robust to asymetry

I

cor(log(wage), education)

I

cor(log(wage), education, method="spearman")

I Results differ a bit

I

plot(log(wage)~education)

I scatterplot shows little correlation

I but log makes it difficult to see graphically

(25)

1 numerical & 1 categorical

I

Often, conditionnal moments are calculated

I e.g. average wage by ethnicity

I tapply(log(wage), ethnicity, mean)

I “Applies” the command “mean” on the 2 variables ethnicity &

log(wage)

I Mean may be replaced by any valid command, e.g quantile

I

The Box plots & QQ (quantile-quantile) plots are often used

(26)

Research in Applied Econometrics Chapter 1. R R graphics

1 numerical & 1 categorical : Box plot

I A box plot is a crude representation of an empirical distribution

I The box is limited by “hinges” (1º & 3º quartiles) and show the median

I Outside of the box, 2 lines indicate the smallest & largest obs.

I within 1.5×size of the box from the closest hinge I Any obs. outside is represented by separate points I

boxplot(log(wage)~ethnicity)

(27)

1 numerical & 1 categorical : QQ plot

I A QQ plot matchesthe quantiles of 2 (empirical) distributions

I Recall that quantiles are quantities

I e.g. the 1º quartile of afam wage is the wage s.t. 25% of afam make less & 75% +

I If the 2 distributions are identical : QQ plot = diagonal I Otherwise, if e.g. cauc make more than afam, then

I with cauc on the x-axis, the QQ plot will be below the diag.

I A bit like the plot of income inequality, but with 2 variables I awage <- subset(CPS1988, ethnicity == "afam")$wage I cwage <- subset(CPS1988, ethnicity == "cauc")$wage I qqplot(awage, cwage)

I abline(0,1)overlay the diag (intercept 0, slope 1) I

detach(CPS1988) to close CPS1988

(28)

Research in Applied Econometrics Chapter 1. R Linear Regressions

Outline

R basic operations

R graphics

Linear Regressions

Discussing Regressors and Model Building

(29)

Basic Regression Commands in R

I

Linear Regression Model LRM

yi

=

xi0β

+

i

with

i

= 1...n

I In matrix formy =+ I

Typical Hyp. in cross-sections

I E(|X) = 0 (exogeneity)

I Var(|X) =σ2I (“sphericity” : homoscedasticity & no autoc.) I

In R, models are usually fitted by calling a command

I For the LRM in cross-section :fm <- lm(formula, data,...) I Argument ... replace a series of arguments

I describing the model

I or choosing the computation mode (algorithm) I or options

(30)

Research in Applied Econometrics Chapter 1. R Linear Regressions

Basic Regression Commands in R

I

The lm command returns an

object

I Here : the fitted model under the namefm I Maybe visualised in many ways or summarized

I

The lm object can be used to compute/extract :

I Predictions & fitted values, residuals, ... by means of fm$...

I Tests & several postestimations diagnostics I

Most estimation commands work the same way

I

Ideally do SWIRL, course : Regression, lessons 1-7

I My course will then be easier

(31)

Multivariate Linear Regression with Factors

I

The purpose of this example is to demonstrate various R tools that are used to transform & combine regressors

I

Dataframe :

cps1988

as before

(32)

Research in Applied Econometrics Chapter 1. R Linear Regressions

Wage Equation

I

Wage Equation

log (wage) =

β1

2exp+β3exp2

4education+β5ethnicity

+

cps_lm<-lm(log(wage)~experience+I(experience^2) +education+ethnicity, data=CPS1988)

I

“Insulation function” I( )

I indicates to R that ^2 be understood as the square of exp I otherwise, R is unsure of the meaning and withdraws

experience^2

I That is because the model written in R issymbolicnot mathematic

I This might be clearer with a formula y ~ a + b+c I In a regression context, this meansy against 3 regressors I Instead y ~ a + I(b+c) meansy against 2 regressors

(33)

Results & Testing

I

summary(cps_lm)

I The return of education (to the wage) is 8.57%/year I % interpretation because wage is in log model I Categorical variables are managed by R

I that selects the reference cat.

I

Compare Nested Models : Anova (Analysis of Variance) Table

I More constraint model :

cps_noeth<-lm(log(wage)~experience+ I(experience^2)+education, data=CPS1988)

I Usually, the test is on more than one variable I anova(cps_noeth,cps_lm)

I Compare non-nested model ? next year

(34)

Research in Applied Econometrics Chapter 1. R Linear Regressions

Interactions : effects of combined regressors

I

e.g. in labor econ : the combined effect of education &

ethnicity

I Does one year of Education have the same return for different ethnicities ?

I

This is modeled with

multiplicative

terms

I Consider

log (wage) =β12ethnicity+β3ethnicity×education+β4education+

I Thenlog (wage)/∂education=β3ethinicity+β4

I Ifethinicity = 0, then the effect of 1 year of education isβ4

I Ifethinicity = 1, then the effect of 1 year of education is β3+β4

I

More examples

I Let a, b, c three factors so that each has several discrete levels I and x, y two continuous variables (quantitative)

(35)

Several Models/Formulas with Interactions

I

y~a+x : no interaction

I A single slope (of x) but one intercept for each level of factor a I

y~a*x : same as previous model +

I one interaction term for each level of a with x (different slopes) I In a more formal notation, letdai =I(a=i) :

[y∼ax]

"

y =βai

X

i

dai+γaixX

i

dai

#

I Note : y~a*x is the same as y~a+x+a*x

(36)

Research in Applied Econometrics Chapter 1. R Linear Regressions

Formulas with Interactions

I

y~(a+b+c)^2

I models all the interactions at 2 variables I but not at 3

I So this is like as many dichotomous variables as the number of levelsdai−bj=I(a=ib=j) for a & b

I and similarly for a & c and for c & b

I

SWIRL course Regression Models

I Lesson 8 : MultiVar Examples3

(37)

Interactions Wage eq. : ethnicity & education

I cps_int<-lm(log(wage)~experience+I(experience^2) +education*ethnicity, data=CPS1988)

I Only one of the “+” fromcps_lm has been replaced by* I

coeftest(cps_int)

I A more compact version of summary( )

I That can also be used on some other regression commands I

The regression outputs the effects of education & ethnicity

I called “main effects”

I and the product of education & an indicator for the level

“afam” ofethnicity

I Why afam ? Probably because it is less numerous than cauc

(38)

Research in Applied Econometrics Chapter 1. R Linear Regressions

Interactions Wage eq. : ethnicity & education

I

afam has a neg. effect on the intercept

I lower average wage for african-american I AND on the slope ofeducation

I lower return ofeducationfor african-american

I

The effect is not much significant though

I since a 5% significance with a sample of nearly 30 000 individuals is not much convincing

I Next year, we’ll see specifications in which this effect disappears

(39)

Predictions

I

First define the values for which you want to predict

I We simplify the model to exp. & educ. for ease of presentation I Let’s say we want to show the effect of exp. at an average level

of educ.

I

Create a new data frame with a column of average Educ & a column of all the possible values of Exp

I Note that in the Census, some people have negative experience !

I This is because Exp. is computed as age-education-6 so that people who complete their studies early may have “negative”

Exp

I

Use a predict( ) command on

I the lm object of interest : cps_lm here

I the new data set for which we want prediction : cps2 here I predict( ) can not only gives a prediction but also bounds I Plot that on the data

I

detach(CPS1988) when you are done to avoid confusion

(40)

Research in Applied Econometrics Chapter 1. R Discussing Regressors and Model Building

Outline

R basic operations

R graphics

Linear Regressions

Discussing Regressors and Model Building

(41)

When building a model, there are 2 contradictory forces

I

If we omit a regressor, and it is in fact relevant

I unobserved heterogeneity & inconsistency of LS estimators I if it is correlated to included regressors

I we sometimes can deal with that using instruments or panel I

If we include irrelevant regressor that are correlated with

relevant ones

I we create multicollinearity with the consequence that both relevant & irrelevant regressors may appear non-signif.

I That may even occur with 2 relevant regressors, e.g. in a Quantity-Price relation, the price of the subtitutes goods are relevant, but may be correlated with own price

(42)

Research in Applied Econometrics Chapter 1. R Discussing Regressors and Model Building

Collinearity – Endogeneity Trade-off

I

From a statistical point of view, 2 collinear variables carry the same information

I Their separate influence on the dependant variable cannot be assessed in the present sample

I Be pragmatic : reject one of the 2 or merge them in some way that makes sense in context

I

It is not really possible to escape such a trade-off

I Especially since in a particular sample, a relevant regressor may coincidentally appear non significant (if the sample is not large) I

Theory does not help by nature

I since an empirical model is a trial of a model I theory helps interpreting results, not guide them

(43)

Progressive Inclusion

I

is an old way of looking at model building

1.

Among potential regressors

x, take the one with highest

correlation with

y

2.

Regress

y

on that single regressor

I Is it significant ?

I No : you don’t have a model

I Yes : estimate the one-regressor model & compute its residuals

3.

Among the remaining regressors, take the one with highest correlation with the residuals

4.

Repeat previous steps with progressively more regressors

I Until one that is non-significant

(44)

Research in Applied Econometrics Chapter 1. R Discussing Regressors and Model Building

Progressive Inclusion

I

The issue with this approach is that if there are several relevant regressors

I then at least the first step might be inconsistent I because at least one relevant regressor is missing

I

This is a very serious issue that leads to non-sensical results

(45)

Progressive Elimination

I

Instead, consider the “largest reasonnable set of regressors”

I can be linked to the theory you want to test or to previous experience

I

It is risky to just run this “encompassing” regression and report the results

I because of multicollinearity

(46)

Research in Applied Econometrics Chapter 1. R Discussing Regressors and Model Building

Progressive Elimination

I

Gradually remove regressors one by one

I Examine how the estimates of the remaining regressors evolve I Use ANOVA to test over several regressors

I If there is a noticeable increase in significance I but not so much change in estimates I collinearity was an issue

I If estimated coefficients change wildly I omitted regressor endogeneity

I

However

I dropping collinear regressor could lead to jumps in coef estimates

I after all, collinearity affects their variance

I dropping a relevant regressor does not necessarily lead to major changes in the other coef

I when that regressor is not much correlated to the others

(47)

Summing up

I

Model

yi

=

β0

+

β1x1i

+

β2x2i

+

i

(no missing relevant regressor)

I estimation by MCO whenx2 andx1 are correlated I if they are not, there is NO serious consequences for ˆβ1

I “not relevant but correlated to a relevant regressor” might not be empirically common

x2

Consequences on ˆ

β1

on ˆ

β2

relevant included Best case, but still may appear insignificant

not incl. Inconsistent –

not relev. included May appear insignificant should

0

not incl. No problem presumably –

(48)

Research in Applied Econometrics Chapter 1. R Discussing Regressors and Model Building

Essay #1

In R, type data(package="AER")

From the list of data sets, select the one closest to your family name. Use help(dataset name) for a description of the data set.

Write an R code that uses this dataset to :

1.

Plot a scatterplot, an histogram, a spine plot (between categorical variables) and a boxplot.

2.

Select a dependant variable and a couple of explanatory variables. Regress the former against the latter, with and without interactions (so : 2 regressions). This need not make much economic sense, as the focus is on the R code, but it is better when it is a sensible economic regression.

3.

Discuss all results, plots and regressions (including

significance) in the R code in comments immediately following

each command.

Références

Documents relatifs

Le jeu consiste à attraper, à l’aide d’une canne munie à son extrémité d’un crochet, un canard en plastique placé à la surface de l’eau.. Cette eau est mise en

Si on suppose que la totalité de l’acide éthanoïque réagit, expliquer en utilisant l’équation de la réaction, pourquoi la quantité de matière maximale

[r]

(15) (*) A topological space is quasi-compact if one can extract from any open covering a finite open covering.. Show that a Noetherian topological space

Dans l’éventualité où vous seriez demandeur d’emploi au 5 septembre 2022 et, dans ce cas, inscrit auprès de Pôle Emploi 6 mois avant votre entrée en formation en catégorie 1 ou

[r]

Le modèle SysML doit être un outil de communication utilisé dans toutes les phases de vie du produit afin que les différents acteurs puissent communiquer et mesurer les écarts entre

TOTAL MATERIEL APPARTENANT A DES TIERS MEMOIRE.