• Aucun résultat trouvé

Approximating likelihood ratios with Calibrated Classifiers

N/A
N/A
Protected

Academic year: 2021

Partager "Approximating likelihood ratios with Calibrated Classifiers"

Copied!
24
0
0

Texte intégral

(1)

Approximating likelihood ratios with

calibrated classifiers

Gilles Louppe

June 22, 2016 MLHEP, Lund

(2)

Joint work with

Kyle Cranmer

New York University

Juan Pavez

Federico Santa Mar´ıa University

(3)
(4)
(5)

Testing for new physics

p(data|theory + X ) p(data|theory)

(6)

Likelihood-free setup

• Complex simulator p parameterized by θ;

• Samples x ∼ p can be generated on-demand;

• ... but the likelihood p(x|θ) cannot be evaluated!

(7)

Simple hypothesis testing

• Assume some observed data D = {x1, . . . , xn}; • Test a null θ = θ0 against an alternative θ = θ1;

• The Neyman-Pearson lemma states that the most powerful test statistic is λ(D; θ0, θ1) = Y x∈D pX(x|θ0) pX(x|θ1) .

(8)

Straight approximation

1. Approximate pX(x|θ0) and pX(x|θ1) individually, using density

estimation algorithms;

2. Evaluate their ratio r (x; θ0, θ1).

Works fine for low-dimensional data, but because of the curse of dimensionality, this is in general a difficult problem! Moreover,it is

not even necessary!

pX(x|θ0)

pX(x|θ1) = r (x; θ0, θ1)

/

When solving a problem of interest, do not solve a more general problem as an intermediate step. – Vladimir Vapnik

(9)

Likehood ratio invariance under change of variable

Theorem. The likelihood ratio is invariant under the change of variable U = s(X), provided s(x) is monotonic with r (x).

r (x) = pX(x|θ0) pX(x|θ1)

= pU(s(x)|θ0) pU(s(x)|θ1)

(10)

Approximating likelihood ratios with classifiers

• Well, a classifier trained to distinguish x ∼ p0 from x ∼ p1

approximates

s∗(x) = pX(x|θ1) pX(x|θ0) + pX(x|θ1)

, which is monotonic with r (x).

• Estimating p(s(x)|θ) is now easy, since the change of variable s(x) projects x in a 1D space, where only the informative content of the ratio is preserved.

This can be carried out using density estimation or calibration algorithms (histograms, KDE, isotonic regression, etc).

(11)

Inference and composite hypothesis testing

Approximated likelihood ratios can be used for inference, since ˆ θ = arg max θ p(D|θ) = arg max θ Y x∈D p(x|θ) p(x|θ1) = arg max θ Y x∈D p(s(x; θ, θ1)|θ) p(s(x; θ, θ1)|θ1) (1)

where θ1 is fixed and s(x; θ, θ1) is a family of classifiers

parameterized by (θ, θ1).

Accordingly, generalized (or profile) likelihood ratio tests can be evaluated in the same way.

(12)

Parameterized learning

For inference, we need to build a family s(x; θ, θ1) of classifiers. • One could build a classifier s independently for all θ, θ1. But

this is computationally expensive and would not guarantee a smooth evolution of s(x; θ, θ1) as θ varies.

• Solution: build a single parameterized classifier instead, where parameters are additional input features (Cranmer et al., 2015; Baldi et al., 2016). T := {}; while size(T ) < N do Draw θ0∼ πΘ0; Draw x ∼ p(x|θ0); T := T ∪ {((x, θ0, θ1), y = 0)}; Draw θ1∼ πΘ1; Draw x ∼ p(x|θ1); T := T ∪ {((x, θ0, θ1), y = 1)}; end while

(13)

Example: Inference from multidimensional data

Let assume 5D data x generated from the following process p0:

1. z := (z0, z1, z2, z3, z4), such that z0∼ N (µ = α, σ = 1), z1∼ N (µ = β, σ = 3), z2∼ Mixture(12N (µ = −2, σ = 1),12N (µ = 2, σ = 0.5)), z3∼ Exponential(λ = 3), and z4∼ Exponential(λ = 0.5);

2. x := Rz, where R is a fixed semi-positive definite 5 × 5 matrix defining a fixed projection of z into the observed space.

Our goal is to infer the values α and β based on D. 10 5 0 5 X 1 12 6 0 6 X2 6 3 0 3 X 3 3 0 3 X0 0 4 8 12 X 4 10 5 0 5 X1 12 6 0 6 X2 6 3 0 3 X3 0 4 8 12 X4 Observed data D

(14)

Example: Inference from multidimensional data

Recipe:

1. Build a single parameterized classifier s(x; θ0, θ1), in this case

a 2-layer NN trained on 5+2 features, with the alternative fixed to θ1 = (α = 0, β = 0).

2. Find the approximated MLE ˆα, ˆβ by solving Eqn.1.

Solve Eqn.1 using likelihood scans or through optimization. Since the generator is inexpensive, p(s(x; θ0, θ1)|θ) can be

calibrated on-the-fly, for every candidate (α, β), e.g. using histograms.

3. Construct the log-likelihood ratio (LLR) statistic −2 log Λ(α, β) = −2 logp(D|α, β) p(D| ˆα, ˆβ)

(15)

Exact −2 log Λ(α, β) 0.90 0.95 1.00 1.05 1.10 1.15 α 1.4 1.2 1.0 0.8 0.6 β Approx. LLR (smoothed by a Gaussian Process) 0.90 0.95 1.00 1.05 1.10 1.15 α 1.4 1.2 1.0 0.8 0.6 β 0 4 8 12 16 20 24 28 32

(16)

Diagnostics

In practice ˆr (ˆs(x; θ0, θ1)) will not be exact. Diagnostic procedures

are needed to assess the quality of this approximation.

1. For inference, the value of the MLE ˆθ should be independent of the value of θ1 used in the denominator of the ratio.

2. Train a classifier to distinguish between unweighted samples from p(x|θ0) and samples from p(x|θ1) weighted by

ˆ r (ˆs(x; θ0, θ1)). 0.7 0.8 0.9 1.0 1.1 1.2 1.3 α 0 2 4 6 8 10 12 14 − 2l og Λ (θ ) Exact Approx., θ1=(α =0,β =1) Approx., θ1=(α =1,β =−1) Approx., θ1=(α =0,β =−1) ±1σ, θ1=(α =0,β =−1) 0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0

True Positive Rate

p(x|θ1)r(x|θ0,θ1) exact

p(x|θ1) no weights

(17)

Density ratio estimation

The density ratio r (x; θ0, θ1) = p(x|θp(x|θ01)) appears in many other

fundamental statistical inference problems, including

• transfer learning,

• outlier detection,

• divergence estimation,

• ...

For all of them, the proposed approximation can be used as a drop-in replacement!

(18)

Transfer learning

Under the assumption that train and test data are drawn iid from a same distribution p, 1 N X xi L(ϕ(xi)) → Z L(ϕ(x))p(x)d x,

as training data increases, i.e. as N → ∞.

(19)

Transfer learning

Under the assumption that train and test data are drawn iid from a same distribution p, 1 N X xi L(ϕ(xi)) → Z L(ϕ(x))ptrain(x)d x,

as training data increases, i.e. as N → ∞.

But we want to be good on test data, i.e., minimize Z

L(ϕ(x))ptest(x)d x.

(20)

Importance weighting

Reweight samples by ptest(xi)

ptrain(xi), such that

1 N X xi ptest(xi) ptrain(xi) L(ϕ(xi)) → Z p test(x) ptrain(x) L(ϕ(x))ptrain(x)d x,

as training data increases, i.e. as N → ∞.

Again, ptest(xi)

ptrain(xi) cannot be evaluated directly, but approximated

(21)

Example

p0 : α = −2, β = 2 versus p1 : α = 0, β = 0

(22)

Example

(23)

Summary

• We proposed an approach for approximating LR in the likelihood-free setup.

• Evaluating likelihood ratios reduces to supervised learning. Both problems are deeply connected.

• Alternative to Approximate Bayesian Computation, without the need to define a prior over parameters.

(24)

References

Baldi, P., Cranmer, K., Faucett, T., Sadowski, P., and Whiteson, D. (2016). Parameterized Machine Learning for High-Energy Physics. arXiv preprint arXiv:1601.07913.

Cranmer, K., Pavez, J., and Louppe, G. (2015). Approximating likelihood ratios with calibrated discriminative classifiers. arXiv preprint arXiv:1506.02169. Louppe, G., Cranmer, K., and Pavez, J. (2016). carl: a likelihood-free inference

toolbox. http://dx.doi.org/10.5281/zenodo.47798,

Références

Documents relatifs

Figure 2 presents the results we obtained (with a 5 fold cross-validation procedure [17]) over 50 iterations with (i) 2-B OOST , (ii) the two single boosted weak learners, and

Soit X une variable al´ eatoire

In agreement with a recent report on the Euroepan network of excellence (1), we believe that EPIZONE could somehow play a role in improving quality, particularly in the present

Therefore, sensory attenuation seems to be only one aspect of a phenomenon which has implications for several cognitive processes (Smallwood &amp; Schooler, 2015). For example, a

traités thermiquement en appliquant la même procédure de traitement, et ce dans le but d’investiguer la décomposition de la solution solide sursaturée dans les

Il se dégage le gaz de dihydrogéne (H 2 ) et du chlorure d’aluminium de formule chimique (AℓCℓ 3 ).laréaction s'accompagne d'une élévation de température a) Donner

Global sensitivity analysis (GSA)is an efficient method to quantify the impact of both causes of LCA results variance from each input’s variationandto support LCA model

Hence, adjusting the profile score by subtracting its bias gives a fixed T unbiased estimating equation and, upon integration, an adjusted profile likelihood in the sense of Pace