A fully parametric approach to minimum power-divergence estimation

(1)

Conference Presentation

Reference

A fully parametric approach to minimum power-divergence estimation

LA VECCHIA, Davide, FERRARI, Davide

Abstract

We approach parameter estimation based on power-divergence using Havrda-Charvat generalized entropy. Unlike other robust estimators relying on divergence measures, the procedure is fully parametric and avoids complications related to bandwidth selection. Hence, it allows for the treatment of multivariate distributions. The parameter estimator is indexed by a single constant q, balancing the trade-off between robustness and efficiency. If q approaches 1, the procedure is maximum likelihood estimation; if q = 1/2, we minimize an empirical version of the Hellinger distance which is fully parametric. We study the mean squared error under contamination by means of a multi-parameter generalization of the change-of-variance function and devise an analytic min-max criterion for selecting q. Optimal q between 1/2 and 1 give remarkable robustness and yet result in negligible losses of efficiency compared to maximum likelihood. The method is considerably accurate for relatively large multivariate problems in presence of a relevant fraction of bad data.

LA VECCHIA, Davide, FERRARI, Davide. A fully parametric approach to minimum

power-divergence estimation. In: 2nd International Workshop of the ERCIM Working Group on Computing & Statistics , Limassol (Cyprus), 29-31 October 2009, 2009, p. 2 p.

Available at:

http://archive-ouverte.unige.ch/unige:75156

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

See discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/253894828

A fully parametric approach to minimum power-divergence estimation

ARTICLE

CITATIONS

2

DOWNLOADS

5

VIEWS

13

2 AUTHORS, INCLUDING:

D. La Vecchia

University of St.Gallen & Monash University 9PUBLICATIONS 21CITATIONS

SEE PROFILE

Available from: D. La Vecchia Retrieved on: 11 September 2015

(3)

A fully parametric approach to minimum power-divergence estimation

D. Ferrari

¹

and D. La Vecchia

²

1 Dipartimento di Economia Politica, Universit`a di Modena e Reggio Emilia, via Berengario 51, 41100 Modena, Italy

2 Dipartimento di Economia (Metodi Quantitativi), Universit`a di Lugano, 6904 Lugano, Switzerland

Keywords:Asymptotic efficiency; Change-of-variance function, M-estimation; Maximum likelihood; Min- imum divergence estimation; Robustness.

Abstract

Let FΘ = {Ft, t ∈ Θ ⊆ R^p}, p ≥ 1 be a family of parametric distributions with densities ft

with respect to Lebesgue measure and letG be the class of all distributions G having densityg with respect to Lebesgue measure. We assumef_t andg to have common supportX ∈R^k,k≥1 and the family FΘ to be identifiable; G represents the “true” distribution generating the data, which is regarded as close but not exactly equal to some member ofF_Θ. Although the focus of our presentation is on continuous distributions, our arguments apply to the discrete case as well.

One way to find parameter values is to minimize a data-based divergence measure between the candidate modelFtand an empirical version ofG. By far, the most popular minimum-divergence method is maximum likelihood estimation, which minimizes the empirical version of the Kullback- Leibler divergence (Kullback (1951), Akaike (1973)). Although the maximum likelihood method is optimal when G ∈ FΘ, even mild deviations from the assumed model can seriously affect its precision. On the other hand, traditional robust M-estimators which can tolerate deviations from model assumptions, do not achieve first order efficiency for most parametric families (Hampel et al. (1986)). Beran (1977) resolved the dispute between robustness and efficiency by considering minimization of Hellinger distance. Beran’s minimum Hellinger distance estimator proved to afford a large fraction of bad data, yet maintaining the highest efficiency at the model. Lindsay (1994) and Basu and Lindsay (1994) extended Beran’s approach by considering the power divergences, a larger class of divergences which includes Hellinger distance as a special case. Related divergences are considered by Basu et al. (1997). The family of power divergences is defined by

Dq(ftkdG) =−1 q

Z

X

Lq

ft(x) g(x)

dG(x), (1)

where L_q(u) := (u^1−q−1)/(1−q), and q ∈ (−∞,∞)\ {1}. Such quantity was first considered by Cressie and Read (1984) in the context of goodness-of-fit testing. Expression (1) includes other notable divergences as special cases: Kullback-Leibler divergence (q→1); twice Hellinger distance (q = 1/2); Neyman’s Chi-square (q = −1) and Pearson’s Chi-square (q = 2). Given the data, the current approaches to minimizing (1) are only feasible in low-dimensional problems, due to the need of replacing g by some smooth kernel estimate. This has two serious drawbacks: (i) some degree of nonparametric analysis for the choice of the kernel bandwidth is unavoidable, with nontrivial complications when dim (X) is large; (ii) the accuracy of the parameter estimators relies on convergence of the kernel smoother, which, however, suffers from the course of dimensionality.

In the present talk, we consider a procedure for parameter estimation based on minimization of (1) which is fully parametric and, consequently, avoids any issue related to kernel smoothing.

Therefore, the method can be applied also when dim (X) and dim(Θ) are moderate or large. Our approach has an information-theoretical flavor as it relies on the generalized entropy (orq-entropy) proposed by Havrda (1967) and later employed by Tsallis (1988) in the context of statistical

(4)

2 A fully parametric approach to minimum power-divergence estimation

mechanics, a quantity which is closely related to (1). The resulting estimator of the parameters is indexed by a single constant q, which allows for tuning the trade-off between robustness and efficiency. If q= 1, our estimator is the maximum likelihood estimator; if q= 1/2, we minimize a version of the Hellinger distance which is fully parametric. Choices 1/2 < q <1 give remarkably robust estimators with negligible efficiency losses compared to maximum likelihood. The method is also understood as the Fisher-consistent version of the estimating procedure proposed by Ferrari and Yang (2009). In the special case of location models, our estimator is related to the minimum density power divergence proposed by Basu et al. (1998).

Convergence results for the parameter estimator are provided and infinitesimal robustness is addressed using two well-established measures: the influence function and change-of-variance function. While the former has been widely employed in literature to approximate the asymptotic bias under contamination, to our knowledge, expressions for the change-of-variance function have been obtained only for one–parameter location or scale problems by Hampel et al. (1986) and Genton and Rousseeuw (1995). We derive a general expression for multi–parameter M-estimators and use it to compute an approximate upper bound for the mean squared error under contamination. The worst-case mean squared error is employed for analytic selection ofq(min–max approach), yielding estimators that successfully balance robustness and efficiency, improving upon both traditional M- estimators and maximum likelihood, regardless whether the data are close or not to the assumed model. This is clearly seen from our numerical studies.

References

H. Akaike (1973). Information Theory and an Extension of the Likelihood Principle, in: 2nd In- ternational Symposium of Information Theory. Editor: Petrov, B.N. and Csaki, F.

A. Basu, I. R. Harris, N. L. Hjort and M. C. Jones (1998). Robust and efficient estimation by minimizing a density power divergence. Biometrika, 85:549–559.

A. Basu and B. G. Lindsay (1994). Minimum disparity estimation for continuous models: efficiency, distributions and robustness. Annals of the Institute of Statistical Mathematics, Vol.

46:683–705.

R. Beran (1977). Minimum Hellinger distance estimates for parametric models.Annals of Mathe- matical Statistics, 5:445–463.

E. Choi, P. Hall and B. Presnell (1984). Rendering parametric procedures more robust by empir- ically tilting the model.Biometrika, 87:453–465.

N. Cressie and T. R. C. Read (1984). Multinomial goodness-of-fit tests.Journal of the Royal Sta- tistical Society, Series B: Methodological, 64:69–80.

D. Ferrari and Y. Yang (2009). MaximumL_q-likelihood. The Annals of Statistics(in press).

M. Genton and P. J. Rousseuw (1995). The change of variance function of M–estimators of scale under general contaminations.Journal of Computational and Applied Mathematics, 64:69–80.

F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw and W. A. Sthael (1986).Robust statistics: The approach based on Influence Functions. Wiley & Sons, New York.

J. Havrda and F. Charv´at (1967). Quantification method of classification processes: concept of structural entropy. Kibernetika, 3:35–80.

C. Tsallis. (1988). Possible generalization of Boltzmann-Gibbs statistics. Journal of Statistical Physics, 52:479-487.