//Users/polome/Pictures/Logos/LogoGATE2015.jpg//Users/polome/Pictures/Logos/Logo-universite-de-lyon-2016.jpg M2CEEPr.PhilippePolomé,UniversitéLumièreLyon22017–2018 ProgrammationdansCh.4.MaximumdeVraisemblance

(1)

Programmation dans

Ch. 4. Maximum de Vraisemblance

M2 CEE

Pr. Philippe Polomé, Université Lumière Lyon 2

2017 – 2018

//Users/polome/Pictures/Logos/LogoGATE2015.jpg

//Users/polome/Pictures/Logos/Logo-universite-de-lyon-2016.jpg

(2)

Plan

I Maximizing a Likelihood I Maximum Simulated Likelihood I NLogit

(3)

Sommaire

Maximum Likelihood

Maximum Likelihood Reminder Programming your Own Likelihood

Maximum Simulated Likelihood Definitions

Unobserved heterogeneity example Illustration

(4)

Reminder on Maximum Likelihood

I think it’s important to offer a reminder of ML before writing a non-standard ML. It will take some time.

(5)

ML 1

The probability density function, or pdf, of a random variabley, conditioned on a set of parameters,θ, is denoted f (y|θ).

I This function identifies the data generating process that underlies an observed sample of data and

I Provides a mathematical description of the data that the process will produce.

The joint density of nindependent and identically distributed (iid) observations from this process is the product of the individual densities ;

f (y₁, ...,y_n|θ) =Qn

i=1f (y_i|θ) =L(θ|y)

This joint density is thelikelihood function, defined as a function of the unknown parameter vector,θ, wherey is used to indicate the collection of sample data.

(6)

ML 2

It is usually simpler to work with the log of the likelihood function :

lnL(θ|y) =Pn

i=1lnf (y_i|θ)

It will usually be necessary to generalize the concept of the likelihood function to allow the density to depend on conditioning variables

I Suppose the disturbancein the classical linear regression model y_i =x_iβ+_i is normally distributed with mean 0 and varianceσ²

I Then, conditioned on its specificx_i ,y_i is normally distributed with meanµ_i =x_iβ and varianceσ²

I Thus the observed random variables are not iid: they have different means

(7)

ML 3

But, the observations are independent, and can be standardized to have same mean agian such as in the likelihood function

lnL(θ|y,X) =

Plnf (yi|x_i, θ) =−¹₂Pn i=1

h

lnσ²+ ln (2π) + (y_i−xiβ)²/σ² i

where X is then × Kmatrix of data with ith row equal to x_i. The rest of this reminder is concerned with obtaining estimates of the parametersθand in testing hypotheses about them

(8)

ML 4

First consider the question of identification : whether estimation of the parameters is possible at all

I Identification is an issue related to the formulation of the model

I The question is : Suppose we had an infinitely large sample, could we uniquely determine the values of θfrom such a sample ? The answer is sometimes no

Identification. The parameter vector θis identified (estimable) if for any other parameter vector,θ^∗ 6=θ, for some data y,

L(θ^∗|y)6=L(θ|y).

(9)

ML 5 : Example 1 Multicollinearity

For the linear regression modely_i =x_iβ+_i, suppose that there is a nonzero vectora such that x_i⁰a=0∀x_i

I That is the case when there is perfect multicollinearity

I Then there is another “parameter” vector, γ =β+a6=β such that x_i⁰γ =x_i⁰β ∀x_i

I When this is the case, then the log-likelihood is the same whether it is evaluated atβ or at γ

I As such, it is not possible to consider estimation of β in this model since β cannot be distinguished fromγ

I Here identification (or the lack thereof) is associated with the data

(10)

ML 6 : Example 2 Identification via normalization

Consider the LRMy_i =β1+β2x_i +_i, where _i|x_i ∼n 0, σ² . I Consider the context of a consumer’s purchases of a large

commodity such as a car where I xi is the consumer’s income

I yi is the difference between what the consumer is willing to pay for the car,p_i^∗, and the price of the car,pi

I Suppose rather than observing p_i^∗ orpi, we observe only whether the consumer actually purchases the car, which, assume, occurs when y_i =p^∗_i −p_i >0.

I Thus, the model states that the consumer will purchase the car if y_i >0 and not purchase otherwise

I The random variable in this model is “purchase” or “not purchase”—there are only two outcomes

(11)

ML 7 : Example 2 Identification via normalization

The probability of a purchase is

Pr{purchase|β₁, β₂, σ,x_i} = Pr{y_i >0|β₁, β₂, σ,x_i}

= Pr{β₁+β2xi +i >0|β₁, β2, σ,xi}

= Pr{_i >−β₁−β2xi|β₁, β2, σ,xi}

= Pr{_i/σ >(−β₁−β₂x_i)/σ|β₁, β₂, σ,x_i}

= Pr{z_i >(−β₁−β2xi)/σ|β₁, β2, σ,xi} wherezi has a standard normal distribution

The probability of not purchase is one minus this probability.

(12)

ML 8 : Example 2 Identification via normalization

Thus the likelihood function is

Y

i=purchase

[Pr{purch|β₁, β2, σ,xi}]× Y

i=not purch

[1−Pr{purch|β₁, β2, σ,xi}]

This is often rewritten as Y

i

[Pr{purchase|β₁, β2, σ,x_i}^yⁱ] [1−Pr{purchase|β₁, β2, σ,x_i}]^(1−yⁱ⁾

The parameters of this model arenot identified:

I If β1,β2 andσ are all multiplied by the same nonzero

constant, regardless of what it is, thenPr{purchase} and the likelihood function do not change.

I This model requires anormalization. The one usually used is σ =1.

(13)

ML 9 Interpretation

I With discrete rv, f (yi|θ)is the probability of observing yi

conditionnally on θ.

I The likelihood function is then the probability of observing the sampleY (conditionnaly onθ).

I We assume that the sample that we have observed is the most likely.

I What value of θmakes the observed sample most likely ? Answer : The value of θthat maximizes the likelihood function since then the observed sample will have maximum probability.

I When y is a continuous rv, instead of a discrete one, we cannot say anymore that f (y_i|θ)is the probability of observing yi conditionnally on θ, but we retain the same principle.

(14)

ML 10 Interpretation

I The value of the parameter vector that maximizesL(θ|data) (or its log) is the maximum likelihood estimates, denotedθ.ˆ I Since the logarithm is a monotonic function, the value vector

that maximizes L(θ|data) is the same as the one that maximizes lnL(θ|data).

I The necessary condition for maximizing lnL(θ|data) is

∂lnL(θ|data)/∂θ=0. This is called the likelihood equation.

(15)

Example : Likelihood Function and Equations for the Normal

Assume a sampleY from ann µ, σ² . I The lnL function is

lnL µ, σ²

=−ⁿ₂ln (2π)−ⁿ₂lnσ²− ¹₂Pn i=1

h(yi−µ)² σ²

i

I The likelihood equations are I ^∂_∂µ^ln^L = _σ¹2Pn

i=1(y_i−µ) =0 and I ^∂_∂σ^ln2^L =−_2σⁿ2+_2σ¹4Pn

i=1(y_i−µ)²=0 I These equations accept an explicit solution at

I µˆML=¹_nPn

i=1yi = ¯y and I σˆ²_ML= ¹_nPn

i=1(yi−y¯)²

I Thus the sample mean is the ML estimator while the ML estimator of the variance is not the usual sample variance (that has an n-1 denominator).

(16)

ML Properties

Conditionnally on correct distributionnal assumptions and under regularity conditions, ML has the following very good properties (proofs : see Greene). Notation:θˆis the ML estimator ;θ0 is the true value of the parameter vector ;θ is any other value.

I Consistency :plimθˆ=θ0

I Asymptotic normality :θˆ∼Nh

θ0,{I(θ0)}⁻¹i I whereI(θ₀) =−Eh

∂²lnL/∂θ₀∂θ₀⁰i

is theinformation matrix.

I ∂f/∂θ₀indicates∂f/∂θ evaluated atθ₀.

(17)

ML Properties 2

I Asymptotic efficiency:θˆis asymptotically efficient if I it is consistent, asymptotically normally distributed, I and has an asymptotic covariance matrix that is not larger

than the asymptotic covariance matrix of any other consistent, asymptotically normally distributed estimator.

I θˆachieves the Cramer–Rao lower bound for consistent estimators.

I Invariance : The ML estimator ofγ0 =c(θ0) isc θˆ

ifc(θ) is a continuous and continuously differentiable function.

I ML has only asymptotic properties : in small samples, it may be biased or inefficient.

(18)

Conditional Likelihoods *

The properties of ML have been studied for the density of an observed random variable and a vector of parameters,f (yi|θ).

I But, econometric models will involve exogenous or

predetermined variables, x, how does that change the results ? I Let the joint density ofyi andxi bef (yi,xi|α)

I By Bayes’ law, f (y_i,x_i|α) =f (y_i|x_i, α)g(x_i|α)

I g is the density ofx_i; we assume that it is not of interest to the analysis, that is that the parameter vector α can be partitionned into[θ, δ].

(19)

Conditional likelihoods 2 *

I The log likelihood function may then be written lnL(θ, δ|data) =

Pn

i=1lnf (y_i,x_i|α) =Pn

i=1lnf (y_i|x_i, θ) +Pn

i=1lng(x_i|δ) I As long as θ andδ have no element in common and no

restriction connects them (such as θ+δ=1), the two parts of the likelihood may be analysed separately.

Now that we have added covariates and parameters to the analysis, certain conditions must be met to maintain the properties of the ML estimator

(20)

Conditional likelihoods 3 *

Below is a “minimal set” of conditions that suffice for a large majority of empirical studies.

I Parameter space. There must be no gaps or non-convexities in the parameter space, e.g. discrete parameters are not feasible.

I Identifiability

I Well behaved data. Primarily (Grenander conditions), as the sample grows :

I Data do not converge to a sequence of zeros,

I No single observation ever dominates a sequence of one variable (each observation becomes less important), I The data matrix always has full rank.

I Endogeneity is still an issue and must be dealt with approprietly (when feasible).

(21)

Standard Application : Linear Regression Model

The Linear Regression Model isy_i =x_i⁰β+_i.

I The likelihood function for a sample of n independent, identically and normally distributed disturbances is L= 2πσ²−n/2

e⁻

0/(^2σ²)

I The sample consists not of but ofy. The transformation fromto y is i =yi−x_i⁰β.

I What is the distribution of y? A reminder on the distribution of a transformation of a rv.

(22)

The distribution of a transformation of a rv

I If x is a continuous rv with pdffx(x) and ify =g(x) is a continuous monotonic function of x, then the density of y is obtained by using the change of variable technique to find the cdf of y

I We start from x, then we change the variable to y via g(x) I Let Pr (x ≤a) =Ra

−∞f_x(x)dx let us change the variable to y=g(x), assume we haveg(−∞) =−∞, writeg(a) =b, then

Pr (y ≤b) =

b

Z

−∞

fx g⁻¹(y)

|∂g⁻¹(y)

∂y |dy

(23)

The distribution of a transformation of a rv

I But by definition Pr (y ≤b) =Rb

−∞f_y(y)dy, so fy(y) =fx g⁻¹(y)

|^∂g⁻¹_∂y^(y)|.

I The term in absolute value is called theJacobian of the transformation.

(24)

Standard Application : Linear Regression Model

The pdf ofy is then f

y_i −x_i⁰β

|^∂_∂yⁱ

i|=n 0, σ²

. The Jacobian is one.

Therefore, the likelihood for then observations of the sample is

L= 2πσ²−n/2

e^−(y^−xβ)

0(y−xβ)/2σ²

Taking logs obtains the familiar sum of squares

lnL=−n

2ln2π−n

2lnσ²−(Y −Xβ)⁰(Y −Xβ) 2σ²

Computing the FOC obtains

βˆ_ML= X⁰X−1

X⁰Y and σˆ_ML² = ˆ⁰/nˆ

That is, the familiar OLS forβ, while the ML estimator of σ² is biased (but consistent).

(25)

Programming your Own Likelihood

The Generalized Cobb-Douglas production function (Zellner and Revankar, 1969) allows for returns to scale that vary with the level of output

Y_ie^θYⁱ =e^β¹K_i^β²L^β_i³ whereY is output, K is capital andLis labor.

From a statistical point of view, this is a transformation of the dependent variable.

(26)

Programming your Own Likelihood

Introducing a multiplicative error leads to a kind of logarithmic form

logYi+θYi =β1+β2logKi+β3logLi +i

I This model is non-linear in the parametersand only for known values of θcan it be estimated by OLS

I Instead, we can write the likelihood function and estimate simultaneously all the parameters of the equation (Zellner &

Ryu, 1998)

I Assume thati ∼n 0, σ² , iid

I We can writei=logYi+θYi−β1−β2logKi−β3logLi and φ(i/σ)≡n(0,1)

I However, the jacobian of the transformation is _∂Y^∂ⁱ

i =^1+θY_Y ⁱ

i

(27)

Programming your Own Likelihood

Likelihood of the model isL=

n

Y

i=1

φ(_i/σ)1+θY_i Yi

I where i =logYi+θYi−β1−β2logKi −β3logLi

I The loglikelihood is then

`=

n

X

i=1

{log (1+θY_i)−log(Y_i)} −

n

X

i=1

logφ(_i/σ)

(28)

Programming your Own Likelihood

Write a function maximizing this log-likelihood wrt the parameter vector(β₁, β₂, β₃, θ, σ²). Three steps

1. code the objective function

2. obtain starting values for an iterative optimization 3. optimize the objective function using the starting values

(29)

Step 1 : Code the Objective Function

To optimize, we will use the functionoptim(); by default it performs minimization, so : minimize the negative of the log-likelihood

I data("Equipment", package = "AER")

I nlogL <- function(par) {defines the function to be minimized I beta <- par[1 :3]

I theta <- par[4]

I sigma2 <- par[5]5 parameters : (β₁, β₂, β₃, θ, σ²)

I Y <- with(Equipment, valueadded/firms)“with” is used to evaluate expressions within a function, i.e. using whatever data called by the function ; here we just divide some variables by another and name them with short names (Y, K, L)

I K <- with(Equipment, capital/firms) I L <- with(Equipment, labor/firms)

(30)

Step 1 : Code the Objective Function (continued)

I rhs <- beta[1] + beta[2] * log(K) + beta[3] * log(L) defines a Right-Hand Side expression

I lhs <- log(Y) + theta * Y

I rval <- sum(log(1 + theta * Y) - log(Y) + dnorm(lhs, mean

= rhs, sd = sqrt(sigma2), log = TRUE)) defines the log lklh, dnorm(..., log = TRUE) is the log of the std normal density, so the expression in dnorm means

(logY_i+θY_i−β1−β2logK_i −β3logL_i)/√

σ², i.e. (rv -mean)/sd, but in matrix notation

I return(-rval)returns the negative of the log lklh I }

(31)

Step 2 : Starting Values

I optim()proceeds iteratively, and thus (good) starting values are needed.

I Obtained from fitting the classical Cobb-Douglas form by OLS I fm0 <- lm(log(valueadded/firms) ~ log(capital/firms) +

log(labor/firms), data = Equipment)

I The resulting vector of coefficients, coef(fm0), is I Amended by 0 : starting value forθ, and

I The mean of the squared residuals from the Cobb- Douglas fit : starting value forσ²

I par0 <- as.vector(c(coef(fm0), 0, mean(residuals(fm0)^2)))

(32)

Step 3 : Optimize

I The new vector par0 containing all the starting values is used in the call tooptim( ) :

I opt <- optim(par0, nlogL, hessian = TRUE)hessian=TRUE requests the hessian mtx (2nd derivatives, cov mtx of param) I optto see the output, use ?optimfor detailed explanations I By default,optim( ) uses the Nelder-Mead method (no

gradient), but there are further algorithms available.

I We set hessian = TRUE in order to obtain standard errors.

(33)

Step 3 : Optimize

Parameter estimates, standard errors, and the value of the objective function at the estimates can now be extracted via

I opt$par

I sqrt(diag(solve(opt$hessian))) I -opt$value

For practical purposes the solution above needs to be verified : several sets of starting values must be examined in order to confirm that the algorithm did not terminate in a local optimum.

(34)

Homework 4

The Poisson estimator. The Poisson distribution is appropriate for a dependent variabley that takes only nonnegative integer values 0,1,2, ... . It can be used to model the number of occurrences of an event, such as number of patent applications by a firm and number of doctor visits by an individual. The density of a Poisson random variable isf (y|λ) =e^−λλ^y/y! withy =0,1,2...The usual Poisson specification hasλ=e^X⁰^β with X is a vector of regressors andβ is a parameter vector, that guarantees that λ >0. It can also be shown thatE(y) =var(y) =λ. Program the likelihood of the Poisson estimator. Find data in R to apply your estimator.

(35)

Sommaire

Maximum Likelihood

Maximum Likelihood Reminder Programming your Own Likelihood

Maximum Simulated Likelihood Definitions

Unobserved heterogeneity example Illustration

(36)

Maximum Simulated Likelihood

We now consider application of the ideas on simulation to ML estimation when no analytical expression is available for the density

I Key result : simulation can lead to an estimator with the same distribution as the MLE, provided that the number of simulation draws made to compute the density for each observation goes to ∞

(37)

Maximum Simulated Likelihood 2

Assumeindependenceover observations and that y has conditional densityf (y|x, θ)

but supposef (y|x, θ) involves an intractableintegral, that is : there is no closed-form expression forf (y|x, θ)

I Instead, we replace the integral by anumerical approximation f˜(y|x, θ), and

I we maximizeln ˜LN(θ) =PN

i=1ln ˜f(yi|xi, θ)with respect toθ I The estimator will be

I consistent and

I have the same asymptotic distribution as ML I iff˜(y|x, θ)is a good approximation tof (y|x, θ)

I The resulting first-order conditions are usually nonlinear and are solved by iterative methods

(38)

Maximum Simulated Likelihood 3

There are several ways to compute the numerical approximation f˜(y|x, θ) – see Cameron & Trivedi ch 12 ; we examine only a simulation approach

I Suppose that we need to estimate the following expression while there is no closed-form solution

f (yi|x_i, θ) = Z

h(yi|x_i, θ,ui)g(ui)dui

I u_i is unobservable, so we cannot keep it to estimate the parameter vector θ; we sayui must be integrated out

(39)

Maximum Simulated Likelihood 4

Thedirect simulatorfor f (yi|x_i, θ)is the Monte Carlo integral estimate

f˜(y_i|x_i,u_iS, θ) = 1 S

S

X

s=1

h(y_i|x_i, θ,u_i^s) (1)

I where u_iS is a vector of S drawsu_i^s ,s =1, ...,S

I that are independent (observed !) draws from the unobserved g(ui)

I We therefore must assume a distribution for the unobserved g(u_i)

We simply averageh(y_i|x_i, θ,u_i^s) over the S draws

f˜_i is unbiased and consistent for f_i as the number of drawsS → ∞

(40)

Maximum Simulated Likelihood 5

I The direct simulator is one case of simulator

I Other simulators exist, in some cases doing a better job at approximating fi , depending on the distribution of g(ui) I Generally we want that the simulator f˜_i be differentiableso

that gradient methods may be used to optimize the likelihood function

I To eliminate “chatter” caused by simulation and help

numerical convergence, the underlying Monte Carlo draws used to construct f˜_i should not be redrawn asθ changes across iterations

(41)

Maximum Simulated Likelihood 6

The Maximum Simulated Likelihood estimator is then simplyθˆMSL

that maximises

ln ˜L_N(θ) =

N

X

i=1

ln ˜f (y_i|x_i, θ) =

N

X

i=1

ln 1 S

S

X

s=1

h(y_i|x_i, θ,u_i^s)

(42)

What is unobserved heterogeneity and why is it a critical issue ?

Given a regression model, e.g.y_i =G(x_i, _i, θ)

I Unobserved heterogeneity is a missingregressor issue, zi

I The missing regressor is therefore captured by the error term i(zi)

I When the missing regressor is correlated with one or more of the x_i regressor, an endogeneityissue arises

I Without properly addressing the issue, most estimators of θ will be inconsistent

I Methods of moments estimators are specifically designed to address this issue

(43)

Unobserved heterogeneity example 2

One cannot know in general whether some regressors are missing I Hausman test is a good precaution, but not a panacea I Consider a site choice model (i.e. multinomial logit where

individuals choose holidays locations)

I If the regressors are only site characteristics, they usually are orthogonal to any individual characteristics

I although one could think of people choosing their residence to be close to their preferred holidays site

(44)

Unobserved heterogeneity example 2

When the regressors include a “part” of the endogenous variable, an endogeneity issue necessarilly arises as the error term will be

correlated with the regressor via the same unobserved factors that affect the endogenous variable

I Example : Assume you have data on stated donations to Greenpeace and you want to regress that on a dichotomous variable stating whether the respondent is a member of an association for environmental protection

(45)

Unobserved heterogeneity example 3

When the missing regressor is orthogonal to the included ones, endogeneity doesnot arise but there is still an issue that may be addressed by MSL

I Suppose that y_i ∼N(θ_i,1), where the scalar parameterθ_i varies across individuals with θ_i =θ+u_i, with u_i representing unobserved heterogeneity that is assumed to have a known distribution.

I The density ofy conditional onu is simply f (y|u, θ) = ^√¹

2πexpn

−(y−θ−u)²/2o

I Inference on θ needs to be based on themarginal density ofy (i.e. marginal with respect to u), which requiresintegrating outu

(46)

Unobserved heterogeneity example 4

I Assume that u has the extreme value density

g(u) =e^−uexp (−e^−u), a skewed distribution that has nonzero mean and for simplicity does not depend on unknown parameters

I Maximum likelihood estimation is not possible as the marginal density f (y|θ), which equalsR

h(y|x, θ,u)g(u)du, has no closed-form solution in u

I We instead use the MSL estimator using the direct simulator 1, so thatθˆMSL maximizes

ln ˆL_N θˆ_MSL

= 1 N

N

X

i=1

ln 1 S

S

X

s=1

√1 2πexp

n

−(yi−θ−u_i^s)²/2o

!

I where u_is,s =1, ...,S, are draws from the extreme value density g(ui) above

(47)

Unobserved heterogeneity example 5

I There is no closed-form solution forθ, but standard iterative methods can be used to compute θˆ_MSL

I Consistency of the MSL estimator requires the number of drawsS → ∞ in addition to the usual sample sizeN → ∞

I So the method is computationally intensive

I The MSL estimator is then asymptotically normally distributed

(48)

Illustration

We carry on with a complete illustration of the unobserved heterogeneity example

(49)

Illustration : Generating Values

I Generating random values, take a sample of size n n=200

I Genr n extreme value (0,1) ; create fcts that can be reused later on

dext=function(x) exp(x)*exp(-exp(x))#Density extreme value rext= function(n) dext(runif(n))# vec of ext values

z=rext(n)#Actually generates the values I Genr n y~n(z,1)

y=rnorm(n, mean=1+z, sd=1)# that generates 200 values of yi with mean 1 + the true unobserved heterog zi. This makes a normal with mean >1 because z has non-zero mean

(50)

Illustration : Generating Values

I Simulate n x S values for purpose of estimation S=10000

nS=n*S zss=rext(nS)

zsmtx <- matrix(zss, nrow=n, ncol=S)

(51)

Illustration : Write the SML function

I Maximum Likelihood MSlogL <- function(par) {

theta <- par[1]

sigma2 <- par[2]

rval <- sum(log(rowSums(

dnorm(y, mean = theta+zsmtx, sd = sqrt(sigma2), log = FALSE))/S))/n

return(-rval) }

# This avoids a loop

# rowSums sums all the elements of one row

(52)

Illustration : Optimization

I Starting values : simply the mean & variance of Y par0 <- as.vector(c(mean(y), var(y)))

I Optimization

opt <- optim(par0, MSlogL, hessian = TRUE) opt$par

sqrt(diag(solve(opt$hessian))) -opt$value

(53)

Illustration : Important results

I The estimation proc "solves" for the real param theta =1, not the average y=1.3 approx

I However, it estimates the total variance of the process at 1.14 approx which is simply the variance of Y

I Compare the results against several values of S and n to show that the param estimate converges to 1 as its sd goes to zero when S goes large

(54)

Homework 5

Consider the standard binary logit regression model from any textbook

1. Write down and program the log-likelihood function 2. Introduce a random intercept assumption in which the

intercept is drawn from a normal distribution with finite mean and variance. What justification can you offer for introducing an unobserved heterogeneity term in this way ?

3. Rewrite the likelihood function conditional on unobserved heterogeneity. Next write down the likelihood function with unobserved heterogeneity integrated out.

4. Program a maximum simulated likelihood estimation procedure to estimate this model.