Discrete choice pseudo panel data models

(1)

Thesis

Reference

Discrete choice pseudo panel data models

BALDE, Thierno

Abstract

Les données de panel sont aujourd'hui d'une importance capitale dans l'analyse du comportement des micro-unités. Or dans beaucoup de pays, ces données n'existent pas encore. A la place, les chercheurs peuvent utiliser des enquêtes répétées. Dans un pareil cas, vu l'impossibilité de suivre la même unité dans le temps, on passe au niveau cohorte tout en introduisant l'hétérogénéité individuelle dans le modèle. Des cohortes construites selon des critères d'homogénéité sont les « individus » du nouveau panel. Cette approche est dite approche des données pseudo panel. Ainsi cette thèse comporte trois chapitres. Le premier traite de l'estimation de modèles à choix binaires avec des effets individuels quand on a des données pseudo-panels. Le deuxième propose une approximation de la distribution théorique exacte obtenue dans le premier chapitre, par la distribution beta. Le troisième chapitre analyse l'impact de l'autonomie des femmes sur l'utilisation des soins de santé en Inde avec des données pseudo-panels.

BALDE, Thierno. Discrete choice pseudo panel data models . Thèse de doctorat : Univ.

Genève, 2014, no. SES 873

URN : urn:nbn:ch:unige-482371

DOI : 10.13097/archive-ouverte/unige:48237

Available at:

http://archive-ouverte.unige.ch/unige:48237

Disclaimer: layout of this document may differ from the published version.

(2)

Discrete Choice Pseudo Panel Data Models

Th` ese pr´ esent´ ee ` a la Facult´ e des sciences ´ economiques et sociales de l’Universit´ e de Gen` eve

par Thierno BALDE

pour l’obtention du grade de

Docteur ès Sciences Économiques et Sociales mention Économétrie

Membres du jury de th` ese

Prof. JayaKrishnakumar, Directrice de thèse, Université de Genève Prof. StefanSperlich, Président du jury, Université de Genève

Docteur Jean-PaulChaze, Universit´e de Gen`eve Prof. Jean-MarieDufour, McGill University, Canada

Genève, le 17 décembre 2014 Thèse No 873

(3)

a autoris´ e l’impression de la pr´ esente th` ese, sans entendre, par l` a, n’´ emettre aucune opinion sur les propositions qui s’y trouvent ´ enonc´ ees et qui n’engagent que la responsabilit´ e de leur auteur.

Gen` eve, le 17 d´ ecembre 2014

Le doyen

Bernard MORARD

Impression d’apr` es le manuscrit de l’auteur.

(4)

Acknowledgements

I finally reach the end of this endeavour, which has been such a big part of my life.

The time has come to look back over the road travelled and reflect on the obstacles I had to overcome, but even more on the individuals who helped me along the way.

First and foremost I wish to express my profound gratitude to my thesis director, professor Jaya Krishnakumar. She played a central role in mapping out the path I was to follow when I attended her econometrics classes as a student. She nurtured my passion for econometrics, and her expertise in that field was decisive in shaping the direction taken by this work. I am thankful for her human and scientific strengths, her priceless help, and the confidence, kindness, and support she showed me during all these years of working on my thesis. Her many readings, comments, suggestions, and corrections made writing this paper possible. Without her help, I would have been unable to produce this thesis. My intellectual debt to her is immense and can never be repaid: I will always be grateful.

I also wish to thank professor Stefan Sperlich, who presided over the jury and whose comments allowed me to improve the quality of this work. I was his assistant and thank him for his confidence in me. He can be assured of my deepest respect and profound gratitude.

I had near-daily interactions with Doctor Jean-Paul Chaze, who agreed to sit on the jury. I wish to especially extend my sincere gratitude for the time he spent reading my work: His comments and corrections were of great help when I was finalizing this paper. I am also very thankful for the pleasant moments we shared over a coffee. . .and will not soon forget them!

My deepest gratitude also goes out to professor Jean-Marie Dufour of McGill Uni- versity (Canada), who agreed to sit on my thesis committee despite his busy agenda.

Just a few years ago it would have been unimaginable to me that he would be on my thesis committee. This was truly my good fortune and a great honour. I would also like to thank him for his constructive comments and insightful suggestions during the pre-defense. He, too, can rest assured of my deep gratitude!

These years of working on the thesis are intimately linked with the teaching I gave as Assistant at the University of Geneva. I would like to thank the various professors with whom I worked, including: professor Christian Gouri´eroux, professor G´erard Antille, professor Daniel Royer, and professor Gilbert Ritschard.

(9)

server at Uni Dufour (University of Geneva), particularly Yan Sagon and Jean-Luc Falcone. Without this powerful computer I would have been unable to run all my Matlab programs.

I am further grateful to all my friends in Geneva, Guinea, and Canada, who were generous in encouraging me and always ready to lend an ear. I particularly wish to thank my best friend, David Kasdas, for his loyal companionship. Our discussions on all subjects, our jokes, our outings, and our trips were like deep breaths of fresh air for me. He was always there when I needed him, especially during the most trying periods in my personal life while I was writing this thesis . . . I am in his debt. Thank you from the bottom of my heart!

I was fortunate enough to be accompanied on this journey by my dear family and, especially, by my parents, my brothers and sisters. Despite the fact that I was ab- sent for many years, their trust, their caring, their love, and their unfailing support always enabled me to overcome obstacles. Thanks to my father and mother for having made me the person I am today. I also think of my maternal grandmother, who never ceased encouraging me in my studies and with whom I would like to share the end of this journey.

Finally, I conclude with my most heart-felt thanks, which must naturally be for my dear companion, Anne, for her support, caring, strength, humour, joy, kindness, originality, love . . . and especially for the understanding she always showed me during the most difficult times. So great was her contribution that this is, truly, our thesis. Thank you for everything!

(10)

Abstract

Panel data are important for analyzing micro-level behavior. Why? They enable us to model and estimate individual heterogeneity. However, even today, these data do not exist for many countries. Instead, the researcher may find annual household surveys based on a large random sample of the population. In such a case, given the impossibility of following the same unit over time, heterogeneity is introduced by following cohorts rather than individuals. Deaton (1985) suggests to build cohorts according to some chosen criteria of homogeneity and consider these cohorts as ‘individuals‘ in a new panel. This approach is called a pseudo- panel data approach. The literature in this area has mainly focused on linear models (static and dynamic) with individual fixed (random) effects. However, many economic investigations require nonlinear models. This research is therefore devoted to nonlinear pseudo panel models (Discrete Choice Pseudo Panel Data Models).

The motivation of this thesis is firstly to take account of individual heterogeneity in a model when we only have pseudo panels, and secondly, to deal with situations where individual data are not available for the response variable but only at the aggregate level.

In the first paper, we derive a probability distribution for the mean of binary variables from cohort data following a discrete approach. We propose an appropriate estimation method and study the properties of estimators by Monte Carlo experi- ments.

Given the complexity of the distribution of the aggregate variable, in the second paper, we make a comparative study of the discrete approach with a continuous approach based on a beta law. This is in order to investigate to what extent our exact discrete distribution can be approximated by the (fitted) beta distribution.

The third paper is an empirical application of the methodology developed by analyzing the impact of women’s autonomy on the use of health care by children in India.

(11)

(12)

R´ esum´ e

Les données de panel sont aujourd’hui d’une importance capitale dans l’analyse du comportement des micro-unités. Pourquoi ? Pour pouvoir tenir compte de l’hétérogénéité individuelle. Or dans beaucoup de pays, ces données n’existent pas encore aujourd’hui. A la place, les chercheurs peuvent utiliser des enquêtes an- nuelles basées sur un large échantillon aléatoire de la population. Dans un pareil cas, vu l’impossibilité de suivre la même unité dans le temps, comment introduire l’hétérogénéité individuelle dans le modèle? En passant au niveau cohorte. Deaton (1985) conseilla de construire des cohortes selon des critères d’homogénéité et de considérer ces cohortes comme des “individus“ du nouveau panel. Cette approche est dite approche des données pseudo panel. La littérature dans ce domaine s’est surtout focalisée sur des modèles pseudo panel linéaires (statiques et dynamiques) avec effets individuels fixes (aléatoires). Alors que de nos jours, beaucoup de situations économiques exigent des modèles non linéaires. Ainsi cette thèse concerne des modèles pseudo-panels non linéaires (modèle à choix discrèt).

La motivation de cette th`ese est donc :

- De pouvoir tenir compte de l’hétérogénéité individuelle dans le modèle quand nous avons des pseudo panels.

- De répondre aux situations où des données individuelles ne seraient pas disponibles pour la variable de réponse mais seulement au niveau agrégé.

Dans un premier papier, nous dérivons la distribution de la moyenne des choix d’une cohorte donnée sachant les variables explicatives individuelles de la dite cohorte, par une approche discrète. Nous proposons une méthode d’estimation appropriée et étudions les propriétés des estimateurs par des expériences de Monte Carlo.

Compte tenu de la complexité de la distribution de la variable agrégée, dans un second papier, nous faisons une étude comparative de cette approche discrète avec une approche continue basée sur une loi beta. Ceci dans le but d’investiguer la qualité d’approximation de la distribution théorique (générique) exacte par la distribution beta.

Enfin, le troisième papier est une application empirique de la méthodologie dévelop- pée en analysant l’impact de l’autonomie des femmes sur l’utilisation des soins de santé par les enfants en Inde.

(13)

(14)

Chapter 1

Estimation of a model with grouped binary dependent variables from repeated

cross-sections data

Thierno BALDE ^∗

University of Geneva, Switzerland Abstract

In this paper we discuss the estimation of a binary choice model with individual effects using a time series of independent cross-sections. We propose a new approach to parametrizing the individual effects that accounts for ‘cohort effects’ as well as purely individual effects. Drawing on Mundlak’s (1978) approach, and postulating certain conditions, we express the ‘cohort effect’ as a linear function of the means of the explanatory variables. In a first setting, we assume that individuals’ choices are not observed and derive the probability that a certain number of individuals choose ‘1’ among the total number of individuals in a given cohort. Then we go on to the special case in which the individual choices are assumed to be observed.

Based on the probabilities of cohort means, we estimate the model using the maximum likelihood method and implement it using a heuristic optimization technique (genetic algorithm). Finally, we carry out Monte Carlo simulations to analyze the finite-sample properties of our estimators, in terms of both bias and mean squared error (MSE).

∗I would like to express my sincere gratitude to Professor Jaya Krishnakumar for her constructive comments. I remain solely responsible for any errors or omissions.

(15)

1.1 Introduction

In this paper we analyze a binary choice model with individual effects in the context of clustered data drawn from repeated cross-sectional data. The issues we encounter in this study include the nonlinearity of the models—a nonlinearity that is exacer- bated by group-level analysis of the dependent variable. This is attributable to the absence of true panel data, which are typically preferred in econometric studies.

Binary choice models are widely used today in the fields of economics, social sciences, political science, and also medical research. For example, assume we wish to study the participation of married women in the labor market. In this case, the dependent variable assumes one of two values, one (1) if the married woman is on the labor market and zero (0) if she is not. Researchers in labor economics suggest that the decision to participate is partly a function of observable characteristics, either of the individual, such as the education level and family income, or of the economy, such as the unemployment rate, and partly a function of factors that are not observable by the researcher. If these unobservable effects are correlated with the explanatory variables, the model cannot be identified without resorting to external instruments in a simple cross-sectional setting. However, if they are time invariant, the model can be identified using panel data. In another example, consider presidential elections in the United States, where there are two dominant political parties: Republicans and Democrats. The dependent variable is the choice of which of these two parties to vote for. Let us say that it assumes the value of one (1) if the chosen candidate is a Democrat, and zero (0) for a Republican. This issue has been the focus of much work by the economist Ray Fair of Yale Univer- sity, with the publication “Econometrics and Presidential Elections,” and by other political scientists. Variables that are often used in voter choice models include the individual characteristics of voters, the inflation and unemployment rates, etc.

Despite widespread interest in these models, the binary dependent variable is often not available at the individual level. This is might be because information on individual choices are very costly to obtain or because the data provided to researchers is restricted owing to privacy concerns. Only aggregate data on choices at the group level are typically published. The most common types of aggregate data are sums and proportions. In political science the behavior of the elector in an election is frequently cited as an example. Because of voters’ privacy rights, their individual choices are masked and we are only given information on the number of votes obtained by each candidate. In medical research, individual-level hospitalization data are usually protected, and only aggregate data (proportions) are accessible. If the explanatory variables for all individuals in a given group are the same, then aggre- gated binary choice models are easy to estimate (Greene, 2004; Maddala, 1983).

Conversely, if explanatory models assume different values from one individual in a group to the next, Miller and Plantinga (1999) recommend using group means of all variables to estimate the model. In this case, we lose some information by using the mean of the variables. Also, the interpretation of the estimated parameters is not at the individual, but rather at the group, level. Consequently, this approach does not allow us to make inferences at the individual level. A situation in which

(16)

at the individual level was illustrated by the Pennsylvania gubernatorial election in 2006. In this election an incumbent Democratic governor faced off against a black Republican candidate. The data was collected by the “The Inter-University Consortium for Political and Social Research (ICPSR)” using questionnaires that were completed by voters on the day of the election. For each voter that partici- pated in the survey several characteristics such as race, sex, age, etc. were observed.

Moreover, Pennsylvania is divided into five (5) geographical districts. For each one we have information on the number of survey participants, the number of electors who voted for the Democratic candidate, and individual characteristics of each participant, but not how he or she voted.

Panel data currently offer a wide variety of benefits for analyzing behavior at the micro level, but they are not available for many countries. Instead, there are annual household surveys that are based on a large sample of the population, such as the

“the British Family Expenditure Survey” or “Labor Supply Survey.” In the case of these repeated cross-sectional surveys we are unable to follow a specific household over time, as would be required for a true panel. Thus, the estimation methods currently used for analyzing panel data are inapplicable. To address this problem, Deaton (1985) suggests using cohorts to estimate linear models. His approach is based on aggregating individuals or households into cohorts and treating the population means of these cohorts as “individuals.” In this fashion, this new panel (called a “pseudo panel”) allows us to track a representative sample from the same cohort of individuals or households over time. The “pseudo panel” approach has not only been used in applied microeconomics, such as for studying income and savings (see, for example, Beach and Finnie, 2004; Bourguignonet al., 2004; Baldini and Mazzaferro, 1999) but also in many areas of research in the social sciences, including healthcare, education, employment, etc. (e.g. Garner et al., 2002; Glied, 2002;

Lauer, 2003; Anderson and Hussey, 2000; Weir, 2003). To construct pseudo-panel datasets, cohorts need to be defined on the basis of a certain number of shared characteristics. The control variable has to be constant for each individual at all points in time or the individual will cease to belong to the group. Also, the control variable must be observable for all individuals in the sample: year of birth, sex, level of education, geographical region are all good criteria for the formation of cells (Dargay and Vythoulkas, 1999). In a word, the construction of these cells has significant ramifications for the magnitude of the bias and the variance of the estimators in a sample of finite size (Verbeek and Nijman, 1992). The problem with the cohort approach is that we have to replace the cohort population means with empirical observations from the samples, creating an issue with measurement error in the variables. In the case of linear models, the classic “within” estimator is biased. Since the variance of measurement errors can be estimated from individual data, Deaton (1985) and Collado (1997) suggest corrections that account for measurement error. Fuller (1987) proposes a more general estimator that includes Deaton’s as a special case. Verbeek and Nijman (1992) demonstrate that we can ignore measurement error if we have a large number of observations per cohort.

In the case of linear models with individual effects in a true panel, the classi- cal estimation method consists of transforming the model by using the deviation from the mean to eliminate individual effects (see, for example, Hsiao [1986] or

(17)

Arellano and Bover [1990]). This method is not relevant to our study, since our model is “strongly” nonlinear. For nonlinear models, Mundlak (1978) and Cham- berlain (1984) suggest parameterizing individual effects as a linear function of the explanatory variables. Once again, this will not work for us because we do not have observations on the same individuals over time. Collado (1998) demonstrates how to overcome this difficulty in parameterizing these effects by using cohort means. In her approach, we have a model with measurement error on the variables causing the error term to be correlated with the explanatory variables. The covariance between the errors and the explanatory variables is a function of the variance of the measurement error, which can be estimated from individual data. To be able to estimate her model, Collado imposes a further restriction that the joint distribution of the error terms and the cohort means be normal. In this fashion she is able to estimate her binary choice model using pseudo-maximum likelihood and the minimum distance method. In our analysis we circumvent this problem of parameterization by expressing the individual-specific effect as a function of the cohort-specific effect while still following the general thrust of Mundlak’s approach.

Subsequently, unlike Collado we estimate a model whose dependent binary variable is an aggregate, but whose explanatory variables are at the individual level, thus capitalizing on all the information available for individuals.

Our paper is organized as follows. In Section 1.2, we specify the model of our study starting from a binary model with individual effects. In Section 1.3 we construct the likelihood function for the variable of interest (the aggregate dependent variable) and perform the estimation using maximum likelihood. Section 1.4 presents the optimization, which uses the genetic algorithm. In Section 1.5 we report the results of our Monte Carlo simulations. Conclusions are presented in Section 1.6.

1.2 Model specification

When studying discrete choice with panel data the following linear model is often postulated for the underlying latent variable:

y^∗_it=x⁰_itβ+αi+uit, i= 1, . . . , N, t= 1, . . . , T (1.1) wherey^∗_itis an unobserved variable,xitis a(k×1)vector of observed explanatory variables,αi is the unobserved individual effect, anduitthe error term. In specification (1.1) we start from the assumption that we retain the same individuals over time, i.e. individualiin period 1 is the same person as individualiin period 2, and so forth. If we assume thatxit,αi,uitare pairwise independent we have a random effects model. Conversely, ifαiandxitare correlated, we can adopt the parametric approach developed by Chamberlain or Mundlak. This reparametrization allows us to establish a relationship between the fixed-effects and the random-effects models, especially in a linear context.

According to the Mundlak approach (1978), the individual effects and the explanatory variables are correlated as follows:

(18)

whereE(wi|xit, t= 1, . . . , T) = 0 andx¯i= _T¹(xi1+xi2+. . .+xiT).

The approach taken by Chamberlain (1980, 1985) is more general. α_iandx_itstand in the following linear relationship:

αi=

T

X

t=1

x⁰_itγt+wi (1.3)

withE(wi|xit, t= 1, . . . , T) = 0.

This last parameterization has its drawbacks, including the fact that it is cumber- some to estimate theγ_tin the presence of a large number of observations.

In order not to confuse with true panel data, we adopt the notation of R. Moffit (1993)¹. Many other authors have adopted this notation in the literature of pseudo- panel data, for example Ainhoa Oguiza Tovar (2012) ². We formulate our model as follows:

y_i^∗

(t)t=x⁰_i

(t)tβ+αi_(t)+ui_(t)t, i_(t)= 1_(t), . . . , N_(t), t= 1, . . . , T, (1.4) where we assume that E(ui_(t)t|xi_(t)t) = 0 for each i_(t) and t, i.e. the variables xi_(t)t are exogenous. Thus, to eliminate any chance of confusion, we index the i^th individual at time t with i_(t). This individual will not be the same from one period to the next. For example, the second individual in period 1, 2₍₁₎, will not be the same person as the second individual in period 2, 2₍₂₎. The number of observations can differ from one period to the next, as is indicated by N(t).

Rather than random effect (RE) or fixed effect (FE), we take the option thatαi_(t)

is potentially correlated with expanatory variablesxi_(t)t(FE) or uncorrelated (RE).

We preferred the first situation because it is very possible that the individual effect α_i_(t) is correlated withx_i_(t)_t. In this case, the Mundlak method presented in (1.2) and the Chamberlain method in (1.3) are inapplicable because we cannot estimate the coefficients of these equations since we don’t have observations on the same individuals from one period to the next.

The variabley_i^∗

(t)tbeing latent, we observe:

y_i_(t)_t=

1 ify^∗_i

(t)t>0

0 otherwise. (1.5)

Thus,

pi_(t)t≡P(yi_(t)t= 1) =P r(x⁰_i

(t)tβ+αi_(t)+ui_(t)t>0). (1.6) Working from the model in (1.1) we can use standard estimation techniques, such as maximum likelihood (after making some preliminary assumptions onu_itandα_i) to estimate the parameters of the model.

1R. Moffit (1993), Identification and estimation of dynamic models with a time series of repeated cross-sections.

2Ainhoa Oguiza Tovar, Inmaculada Gallastegui Zulaica, Vicente Nunez-Anton (2012), Analysis of pseudo-panel data with dependent samples.

(19)

If we do not have a “true” panel, as in (1.4), the estimates obtained will not be consistent. Collado (1998), drawing on the approach in Chamberlain (1984) for individual effects, shows how to obtain consistent results using cohort averages.

In our model, given the absence of a true panel and possibly the unavailability of the dependent variable at the individual level, we work on the level of aggregates (proportions, sums) while retaining explanatory variables at the individual level.

Assume that we have a repeated cross-sectional dataset on binary choices for N individuals with N =N₍₁₎+N₍₂₎+. . .+N_(T₎. Let cohorts of variable sizes (nct) and homogeneous. Also, let N(t) = PC

c=1nct. In each cohort c (c = 1, . . . , C), individual i_(t) gives the response y_i_(t)_t = 1 with probability (p_i_(t)_t), or response y_i_(t)_t= 0with probability (1−p_i_(t)_t). y_i_(t)_tis thus a Bernoulli variable conditional onpi_(t)t, i.e.f(yi_(t)t\pi_(t)t) = (pi_(t)t)^yⁱ^(t)^t(1−pi_(t)t)^1−yⁱ^(t)^t where0< pi_(t)t<1. For the reasons given above, we are interested in the cohort meany¯ct=_n¹

ct

Pnct

i=1yi_(t)t, where 0≤y¯ct ≤1, c= 1, . . . , C. We can establish the table of the distribution of

¯

yctas follows:

¯

yct 0

n_ct = 0 _n¹

ct

2

n_ct . . . _n^k

ct . . . ⁿ_n^ct

ct = 1 Pn_ct

i=1yi_(t)t 0 1 2 . . . k . . . nct

P

y¯_ct= _n^k

ct

=P Pn_ct

i=1y_i_(t)_t=k

P₀ P₁ P₂ . . . P_k . . . P_n_ct

Since the observations are independent (by assumption), we have:

P0=P(¯yct= 0) =P

nct

X

i=1

yi_(t)t= 0

!

=P yi_(t)t= 0, ∀i∈c

=p⁰₁_(t)_t. . . p⁰_n_c

(t)t=

nct

Y

i=1

p⁰_i_(t)_t=

nct

Y

i=1

(1−p¹_i_(t)_t) wherep¹_i

(t)tis the probability that individualiin cohortcchoses one (1) at timet, andp⁰_i

(t)tthe probability of the complementary event.

P₁=P

y¯_ct= 1 nct

=P

nct

X

i=1

y_i_(t)_t= 1

!

=P y_i_(t)_t= 1, y_j_(t)_t= 0, ∀i6=j, i, j∈c

=p¹₁

(t)tp⁰₂

(t)t. . . p⁰_n

c(t)t+p⁰₁

(t)tp¹₂

(t)tp⁰₃

(t)t. . . p⁰_n

c(t)t+. . .+p⁰₁

(t)t. . . p⁰_n_c−

1(t)tp¹_n

c(t)t

=

n_ct

X

i=1

p¹_i

(t)t n_ct

Y

j=1,i6=j

p⁰_j

(t)t

=

nct

X

i=1

p¹_i_(t)_t

nct

Y

j=1,i6=j

(1−p¹_j_(t)_t).

Example: Assume that there are three individuals per cohort (n = 3). What is

(20)

timet.

P

¯ yct=1

3

=P r(1 of the 3 individuals is hospitalized)

=P r(1 hospitalized, 2 and 3 not hospitalized) +P r(2 hospitalized, 1 and 3 not hospitalized) +P r(3 hospitalized, 1 and 2 not hospitalized)

= [ifui_(t)t∼N]

= Φ(x⁰₁

(t)tβ+α1)[1−Φ(x⁰₂

(t)tβ+α2)][1−Φ(x⁰₃

(t)ttβ+α3)]

+ Φ(x⁰₂

(t)tβ+α₂)[1−Φ(x⁰₁

(t)tβ+α₁)][1−Φ(x⁰₃

(t)tβ+α₃)]

+ Φ(x⁰₃

(t)tβ+α3)[1−Φ(x⁰₁

(t)tβ+α1)][1−Φ(x⁰₂

(t)tβ+α2)]

P₂=P

y¯_ct= 2 nct

=P

nct

X

i=1

y_i_(t)_t= 2

!

=P y_i_1(t)_t= 1, y_i_2(t)_t= 1, y_j_(t)_t= 0∀i₁6=i₂6=j, i₁, i₂, j∈c

=p¹₁

(t)tp¹₂

(t)tp⁰₃

(t)t. . . p⁰_n

c(t)t+p¹₁

(t)tp⁰₂

(t)tp¹₃

(t)tp⁰₄

(t)t. . . p⁰_n

c(t)t+. . .+p¹₁

(t)tp⁰₂

(t)t. . . p⁰_n_c−

1(t)tp¹_n

c(t)t

+p⁰₁

(t)tp¹₂

(t)tp¹₃

(t)tp⁰₄

(t)t. . . p⁰_n

c(t)t+. . .+p⁰₁

(t)tp¹₂

(t)t. . . p⁰_n_c−

1(t)tp¹_n

c(t)t

+. . . . +p⁰₁

(t)tp⁰₂

(t)t. . . p¹_n_c−

1(t)tp¹_n_c

(t)t

=

n_ct

X

i1=1

X

i2>i1

p¹_i

1(t)tp¹_i

2(t)t n_ct

Y

j=1,i₁6=i26=j

p⁰_j

(t)t=

n_ct

X

i1=1

X

i2>i1

p¹_i

1(t)tp¹_i

2(t)t n_ct

Y

j=1,i₁6=i26=j

(1−p¹_j

(t)t).

Thus, by deduction, we can write the generic probability of the cohort’s average choice as:

Pk=P

¯ yct= k

nct

=P

nct

X

i=1

yi_(t)t=k

!

=

nct

X

i₁=1

X

i₂>i₁

. . . X

i_k>i_k−1

p¹_i_1(t)_t. . . p¹_i_k(t)_t

nct

Y

j=1,i16=...6=ik6=j

p⁰_j_(t)_t

=

nct

X

i₁=1

X

i₂>i₁

. . . X

i_k>ik−1

p¹_i

1(t)t. . . p¹_i

k(t)t

nct

Y

j=1,i₁6=...6=ik6=j

(1−p¹_j

(t)t).

(21)

When the data on the choicesYi_1(t)tare available, the generic probability becomes:

Pk=P

nct

X

i=1

yi_(t)t=k

!

=p¹_i_1(t)_t. . . p¹_i_k(t)_t

nct

Y

j=1,i16=...6=ik6=j

p⁰_j_(t)_t

=p¹_i

1(t)t. . . p¹_i

k(t)t

n_ct

Y

j=1,i₁6=...6=i_k6=j

(1−p¹_j

(t)t).

Here we have the likelihood of the mean of the observations on cohortc at time t as a function of the parameters β, the individual explanatory variablesxi_(t)t, and the individual effectsαi_(t)texpressed in terms of the individual probabilitiespit. In this formulation, if we let the probability of choosing one (1) be the same for each individual in a given cohort (even though this is not always the case), we end up with a very common probability distribution: the binomial.

This specification creates a major problem in the parametrization of individual- specific effects. In the absence of a true panel, we cannot directly apply the ap- proaches developed by Mundlak (1978) or Chamberlain (1984) for dealing with individual effects. Consequently, we decompose the individual effectsα_i(t)into two parts: α^∗_ctis a “cohort-specific effect” representing the mean of the individual effects of thepopulationin cohortc at timet, and ξ_i(t)is the deviation ofα_i(t)from that mean:

αi(t)=α^∗_ct+ (αi(t)−α^∗_ct)

| {z }

ξ_i(t)

=α^∗_ct+ξi(t). (1.7) The mean of the individual effects of the sample in cohort c at time t, α¯ct, can be decomposed as follows: the population mean in cohort c at time t, α^∗_ct, and a deviation from this mean attributable to sampling error in the data, denotedv_c(t)

¯

α_ct=α^∗_ct+v_c(t). (1.8)

At this stage we can introduce a first order autocorrelation scheme (AR1) in the sampling error i.e

v_c(t)=ρv_c(t−1)+η_c(t), ρ is the same for all cohorts.

Substituting (1.8) into (1.7) yields:

α_i(t)= ¯αct−v_c(t)+ξ_i(t). (1.9) If we assume that populations change little from one period to the next (a very important hypothesis for the construction of our model), then the population mean α^∗_ct is invariant in time, i.e. α^∗_ct =α^∗_c. If we further assume that the cohort size is sufficiently large, i.e. nct'nc→ ∞, then the sampling error tends toward zero (v_c(t)→0). The two preceding assumptions together have the effect thatα^∗_c 'α¯_c, and thus

(22)

Now consider the Mundlak (1978) approach:

α_i(t)= ¯x⁰_iγ+w_i(t), (1.11) with E(w_i(t)|xi_(t)t, t = 1, . . . , T) = 0, w_i(t) ∼ iid(0, σ_w²) and x¯⁰_i (average of the observations over time) unobserved.

A simple algebraic manipulation of (1.11) yields:

α_i(t)=x⁰_i

(t)tγ+ (¯x_i−x_i_(t)_t)⁰γ+w_i(t). (1.12) We can now take the mean of the sample observations for each cohort in each period

1 nct

nct

X

i=1

α_i(t)

| {z }

¯ α_ct

= 1

nct nct

X

i=1

x⁰_i

(t)t

| {z }

¯ x⁰_ct

γ+ 1

nct nct

X

i=1

¯ x_i

| {z }

¯ x_c

− 1 nct

nct

X

i=1

x_i_(t)_t

| {z }

¯ x_ct

!⁰ γ+ 1

nct nct

X

i=1

w_i(t)

| {z }

¯ w_ct

¯

αct= ¯x⁰_ctγ+ (¯xc−x¯ct)⁰γ+ ¯wct.

Assume that we can estimate the conditional expectation (or the true mean) E(xi_(t)t|i ∈ c) = µct = µc ∀i, t from x¯ct. When nct ' nc → ∞ and for similar populations, we have:

¯ xct= 1

nc

X

i∈c

xi_(t)t→µct=µc

1 n_c

X

i∈c

¯ xi= 1

n_c 1 T

X

i∈c

X

t

xi_(t)t→µc.

Thus,

1 n_c

X

i∈c

¯ xi−x¯ct

!

→0.

We also have

¯ w_ct= 1

nc

X

i∈c

w_i_(t)_t→0,

¯

αct→α¯c. Consequently,

¯

αc 'x¯⁰_cγ.

Substituting all these results into (1.10),

α_i_(t) = ¯x⁰_cγ+ξ_i_(t).

(23)

Thus, our latent variable postulated in (1.4) becomes:

y^∗_i

(t)t=x⁰_i

(t)tβ+ ¯x⁰_cγ+ξi_(t)+ui_(t)t, (1.13) with ξ_i_(t) and u_i_(t)_thaving expectation and covariance zero (since the individuals change from one period to the next) and being independent ofx_i_(t)_tandx¯_c. Note that there is no heterogeneity in the coefficientsθ= (β, γ)as they are fixed, so not random.

This model is thus valid for largenct (nct'nc).

For nonlinear models, and specifically in our case, the structural parameters only provide us with information on the relative magnitude of the change inE(y_i_(t)_t/x_i_(t)_t) resulting from a variation in a unit of x_i_(t)_t, while the marginal effects provide us with the absolute magnitude of the change. This is why, when estimating binary choice models, we are often interested in:

i. The signs and statistical significance of the coefficients.

ii. The marginal effects. For example, in a Probit model, E(yi_(t)t/xi_(t)t) = Φ(x⁰_i

(t)tβ+ ¯x⁰_cγ) and the marginal effects are computed as

M E= ∂E(yi_(t)t/xi_(t)t)

∂x_i_(t)_t =

β+ 1 n_ctγ

φ

x⁰_i

(t)tβ+ ¯x⁰_cγ

(1.14) for a continuous variablex_i_(t)_t. We see that, unlike in the case of the linear model, here the marginal effect is the product of two factors: all the effects of the explanatory variables on the latent variable, as well as the derivative of the normal cumulative function evaluated at pointy_i^∗

(t)t. Furthermore, if we consider

E(yi_(t)t/xi_(t)t, di_(t)t) = Φ x⁰_i

(t)tβ+ ¯x⁰_cγ+δdi_(t)t

the marginal effects are

M E = Φ

x⁰_i_(t)_tβ+ ¯x⁰_cγ+δ)−Φ(x⁰_i_(t)_tβ+ ¯x⁰_cγ

(1.15) for a discrete variabled_i_(t)_t.

1.3 Maximum likelihood estimator

How can we use the maximum likelihood method to estimate the parameters β and γ in this nonlinear model? Assume that we have a “pseudo-panel” of di- mension (PC PT

n ), where C is the number of cohorts, n the size of the

(24)

β and γ are the vectors βˆ and γˆ that give the highest probability of obtain- ing{y¯11, . . . ,y¯1T, . . . ,y¯C1, . . . ,y¯CT}conditional on the explanatory individual variables. This joint probability is written:

L(β, λ; ¯y11, . . . ,y¯1T, . . . ,y¯C1, . . . ,y¯CT) =P(¯y11, . . . ,y¯1T, . . . ,y¯C1, . . . ,y¯CT;β, γ).

By construction of the pseudo-panel, the observations are independent of each other, and so the likelihood is:

L(β, λ; ¯y11, . . . ,y¯1T, . . . ,y¯C1, . . . ,y¯CT) =

T

Y

t=1 C

Y

c=1

P(¯yct)

=

T

Y

t=1

P(¯y1t)P(¯y2t). . . P(¯yCt)

=

T

Y

t=1





n1t

X

i₁=1

X

i₂>i₁

. . . X

i_k>ik−1

p¹_i_1(t)_t. . . p¹_i_k(t)_t

n1t

Y

j=1,i16=...6=ik6=j

(1−p¹_j_(t)_t)





| {z }

cohort forc= 1at timet

. . .





n_Ct

X

i1=1

X

i2>i1

. . . X

ik>ik−1

p¹_i

1(t)t. . . p¹_i

k(t)t

n_Ct

Y

j=1,i₁6=...6=ik6=j

(1−p¹_j

(t)t)





| {z }

cohort forc=Cat timet

.

To simplify the calculations, and because the function log is monotonic, it is ad- visable to work with the log-likelihood function. Thus:

logL(β, λ;y11, . . . , y1T, . . . , yC1, . . . , yCT)

=

T

X

t=1

[logP(¯y1t) + logP(¯y2t) +. . .+ logP(¯yCt)]

=

T

X

t=1

( log

ⁿ1t

X

i1=1

X

i2>i1

. . . X

ik>ik−1

p¹_i

1(t)t. . . p¹_i

k(t)t

n_1t

Y

j=1,i16=...6=i_k6=j

(1−p¹_j

(t)t)

| {z }

cohort forc= 1at time t

+. . .+ log ⁿCt

X

i₁=1

X

i₂>i₁

. . . X

i_k>i_k−1

p¹_i₁_t. . . p¹_i_k(t)_t

nCt

Y

j=1,i16=...6=ik6=j

(1−p¹_j_(t)_t)

| {z }

cohort forc=Cat time t

)

Discrete choice pseudo panel data models

Thesis

Reference

Discrete choice pseudo panel data models

BALDE, Thierno

BALDE, Thierno. Discrete choice pseudo panel data models . Thèse de doctorat : Univ.

Genève, 2014, no. SES 873

URN : urn:nbn:ch:unige-482371

DOI : 10.13097/archive-ouverte/unige:48237

Discrete Choice Pseudo Panel Data Models

Th` ese pr´ esent´ ee ` a la Facult´ e des sciences ´ economiques et sociales de l’Universit´ e de Gen` eve

par Thierno BALDE

Membres du jury de th` ese

a autoris´ e l’impression de la pr´ esente th` ese, sans entendre, par l` a, n’´ emettre aucune opinion sur les propositions qui s’y trouvent ´ enonc´ ees et qui n’engagent que la responsabilit´ e de leur auteur.

Gen` eve, le 17 d´ ecembre 2014

Le doyen

Bernard MORARD

Impression d’apr` es le manuscrit de l’auteur.

Contents

Acknowledgements

Abstract

R´ esum´ e

Chapter 1

Estimation of a model with grouped binary dependent variables from repeated

cross-sections data

1.1 Introduction

1.2 Model specification

1.3 Maximum likelihood estimator