• Aucun résultat trouvé

Bayesian Nonparametric Mixtures Why and How?

N/A
N/A
Protected

Academic year: 2021

Partager "Bayesian Nonparametric Mixtures Why and How?"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-01950664

https://hal.archives-ouvertes.fr/hal-01950664

Submitted on 11 Dec 2018

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Bayesian Nonparametric Mixtures Why and How?

Julyan Arbel

To cite this version:

Julyan Arbel. Bayesian Nonparametric Mixtures Why and How?. IFSS 2018 - 2nd Italian-French Statistics Seminar, Sep 2018, Grenoble, France. �hal-01950664�

(2)

Bayesian Nonparametric Mixtures Why and How?

www.julyanarbel.com, Inria, Mistis

Introduction

Bayesian nonparametric framework

Massively many parameters

Inference on curves: pdf, cdf, hazard, link. . .

Mixtures, exchangeable data Xn = (X1, . . . , Xn) X1, . . . Xn | P

( P 1 R

Θ k( · |θ)P(dθ) 2 3 , Natural uncertainty quantification

, Flexibility, avoids over-fitting by regularization (prior)

, Adapt to data complexity , Underlying clustering

/ Justify prior, expert

/ Efficient posterior sampling / Quantify truncation error

What prior for P?

Learn about data through posterior dist.

Discrete random probability measure prior

Random weights (pi)i and locations i)i

P =

X

i=1

piδθi

Dirichlet process DP(α, G0) (Ferguson, 1973) Predictive: Chinese Restaurant Process

P(Xn+1 ∈ · | Xn) = α

α + nG0 + 1 α + n

kn

X

j=1

njδX

j

Or for varying P(Xn+1 new ) . . . y

2 Density Estimation

Ecological risk assessment (Arbel et al., 2016b)

Data are species critical effect concentrations (CEC), possibly censored

Estimation of species sensitivity distribution (SSD), the density of CEC

Safe concentration which protects most of the species: 5th percentile of the SSD (HC5)

Very moderate sample sizes, 10 50

BNP describes well variability of the data, without being prone to over-fitting

Species clustering as an outcome

0.0 0.2 0.4

−2.5 0.0 2.5 5.0

Data

Density

0.0 0.1 0.2 0.3 0.4 0.5

−1.5 −1.0 −0.5 0.0 0.5 1.0

Data

Model

BNP KDE Normal

3 Survival Analysis

Bayesian hazard mixture (Arbel et al., 2016c)

Data are (remission) times possibly censored

Prior on hazard rate h(t) for every time t

Induces prior on survival function S (t)

Availability of post. mean, median, mode

Smooth estimator VS Kaplan–Meyer

Proper uncertainty quantification

Time (in weeks)

0 35 70

0.0 0.5 1.0

Time (in weeks)

0 35 70

0.0 0.5 1.0

Funding

This work is supported by the European Research Council (ERC) through StG “N-BNP” 306406.

1 Species Modeling

Data can be species, microbes, words, genes. . . Discovery probabilities (Arbel et al., 2016a)

Estimation of `-discovery

D` = P(Xn+1 is a species seen ` times)

Comparison with Good-Turing estimator

Closed form posterior and estimators

Uncertainty quantif., unavailable for GT

2nd order (fast) approximations

Diversity in ecology (Arbel et al., 2015, 2016d)

Assess impact of pollution on microbial community via study of diversity

Div = P

i pi log pi

Model detects an hormetic effect

Uncertainty quantification

Prediction across full range of covariates

Fuel concentration (g TPH/kg soil)

Shannon diversity

0 5 10 15 20 25

0 1 2 3 4 5

References & Collaborators

Arbel, J., Favaro, S., Nipoti, B., and Teh, Y. W. (2016a).

Bayesian nonparametric inference for discovery proba- bilities: credible intervals and large sample asymptotics.

Statistica Sinica.

Arbel, J., Kon Kam King, G., and Prünster, I. (2016b).

Bayesian nonparametric modelling of species sampling distributions. In preparation.

Arbel, J., Lijoi, A., and Nipoti, B. (2016c). Full Bayesian inference with hazard mixture models. Computational Statistics & Data Analysis.

Arbel, J., Mengersen, K., Raymond, B., Winsley, T., and King, C. (2015). Application of a Bayesian nonparametric model to derive toxicity estimates based on the response of Antarctic microbial communities to fuel contaminated soil. Ecology and Evolution.

Arbel, J., Mengersen, K., and Rousseau, J. (2016d). Bayesian nonparametric dependent model for partially replicated data: the influence of fuel spills on species diversity.

Annals of Applied Statistics.

Arbel, J. and Prünster, I. (2016). A moment-matching Ferguson & Klass algorithm. Statistics and Computing. Ferguson, T. (1973). A Bayesian analysis of some

nonparametric problems. The Annals of Statistics.

Miller, J. W. and Harrison, M. T. (2014). Inconsistency of Pitman-Yor process mixtures for the number of components. The Journal of Machine Learning Research.

Wade, S. and Ghahramani, Z. (2015). Bayesian cluster analysis: Point estimation and credible balls. arXiv.

Open Questions

How to best use underlying clustering?

(Wade and Ghahramani, 2015)

Find consistent estimator of number of clusters: posterior inconsistent (Miller and Harrison, 2014), what about posterior mode?

Devise efficient posterior sampling, trunca- tion error (Arbel and Prünster, 2016)

Références

Documents relatifs

Applied to RNA-Seq data from 524 kidney tumor samples, PLASMA achieves a greater power at 50 samples than conventional QTL-based fine-mapping at 500 samples: with over 17% of

Within a pair, distinct NLR proteins: (i) form homo- and hetero-complexes that are involved in important regulation pro- cesses both prior to and following pathogen recognition,

In addition by comparing dried spots on filter paper made of whole blood and nasal swabs from animals sampled in the field at different clinical stages of the disease, we dem-

Following co-infection of plants with Tomato yellow leaf curl virus (TYX) and Tomato leaf curl Comoros virus (TOX), the frequency of parental and recombinant genomes was

Using species distribution models to optimize vector control in the framework of the tsetse eradication campaign in Senegal..

Plus précisément, nous testons l’hypothèse selon laquelle la mise en conformité à la norme GlobalGAP est à l’origine d’un plus gros volume de litchis vendus aux exportateurs

Beyond simple bounds derived from base policies, the only theoretical results given explicitly for rollout algorithms are average-case results for the breakthrough problem

Two methods have been used to disentangle the P-wave spin states in the hard transitions: inclusive converted photon searches, used in a recent BABAR analysis [20] ; and