Approximating likelihood ratios with Calibrated Classifiers

(1)

Approximating likelihood ratios with

calibrated classifiers

Gilles Louppe

June 22, 2016 MLHEP, Lund

(2)

Joint work with

Kyle Cranmer

New York University

Juan Pavez

Federico Santa Mar´ıa University

(3)

(4)

(5)

Testing for new physics

p(data|theory + X ) p(data|theory)

(6)

Likelihood-free setup

• Complex simulator p parameterized by θ;

• Samples x ∼ p can be generated on-demand;

• ... but the likelihood p(x|θ) cannot be evaluated!

(7)

Simple hypothesis testing

• Assume some observed data D = {x1, . . . , xn}; • Test a null θ = θ0 against an alternative θ = θ1;

• The Neyman-Pearson lemma states that the most powerful test statistic is λ(D; θ0, θ1) = Y x∈D pX(x|θ0) pX(x|θ1) .

(8)

Straight approximation

1. Approximate pX(x|θ0) and pX(x|θ1) individually, using density

estimation algorithms;

2. Evaluate their ratio r (x; θ0, θ1).

Works fine for low-dimensional data, but because of the curse of dimensionality, this is in general a difficult problem! Moreover,it is

not even necessary!

pX(x|θ0)

pX(x|θ1) = r (x; θ0, θ1)

/

When solving a problem of interest, do not solve a more general problem as an intermediate step. – Vladimir Vapnik

(9)

Likehood ratio invariance under change of variable

Theorem. The likelihood ratio is invariant under the change of variable U = s(X), provided s(x) is monotonic with r (x).

r (x) = pX(x|θ0) pX(x|θ1)

= pU(s(x)|θ0) pU(s(x)|θ1)

(10)

Approximating likelihood ratios with classifiers

• Well, a classifier trained to distinguish x ∼ p0 from x ∼ p1

approximates

s∗(x) = pX(x|θ1) pX(x|θ0) + pX(x|θ1)

, which is monotonic with r (x).

• Estimating p(s(x)|θ) is now easy, since the change of variable s(x) projects x in a 1D space, where only the informative content of the ratio is preserved.

This can be carried out using density estimation or calibration algorithms (histograms, KDE, isotonic regression, etc).

(11)

Inference and composite hypothesis testing

where θ1 is fixed and s(x; θ, θ1) is a family of classifiers

parameterized by (θ, θ1).

Accordingly, generalized (or profile) likelihood ratio tests can be evaluated in the same way.

(12)

Parameterized learning

For inference, we need to build a family s(x; θ, θ1) of classifiers. • One could build a classifier s independently for all θ, θ1. But

this is computationally expensive and would not guarantee a smooth evolution of s(x; θ, θ1) as θ varies.

• Solution: build a single parameterized classifier instead, where parameters are additional input features (Cranmer et al., 2015; Baldi et al., 2016). T := {}; while size(T ) < N do Draw θ0∼ πΘ0; Draw x ∼ p(x|θ0); T := T ∪ {((x, θ0, θ1), y = 0)}; Draw θ1∼ πΘ1; Draw x ∼ p(x|θ1); T := T ∪ {((x, θ0, θ1), y = 1)}; end while

(13)

Example: Inference from multidimensional data

Let assume 5D data x generated from the following process p0:

1. z := (z0, z1, z2, z3, z4), such that z0∼ N (µ = α, σ = 1), z1∼ N (µ = β, σ = 3), z2∼ Mixture(12N (µ = −2, σ = 1),1₂N (µ = 2, σ = 0.5)), z3∼ Exponential(λ = 3), and z4∼ Exponential(λ = 0.5);

2. x := Rz, where R is a fixed semi-positive definite 5 × 5 matrix defining a fixed projection of z into the observed space.

Our goal is to infer the values α and β based on D. 10 5 0 5 X 1 12 6 0 6 X2 6 3 0 3 X 3 3 0 3 X0 0 4 8 12 X 4 10 5 0 5 X1 12 6 0 6 X2 6 3 0 3 X3 0 4 8 12 X4 Observed data D

(14)

Example: Inference from multidimensional data

Recipe:

1. Build a single parameterized classifier s(x; θ0, θ1), in this case

a 2-layer NN trained on 5+2 features, with the alternative fixed to θ1 = (α = 0, β = 0).

2. Find the approximated MLE ˆα, ˆβ by solving Eqn.1.

Solve Eqn.1 using likelihood scans or through optimization. Since the generator is inexpensive, p(s(x; θ0, θ1)|θ) can be

calibrated on-the-fly, for every candidate (α, β), e.g. using histograms.

3. Construct the log-likelihood ratio (LLR) statistic −2 log Λ(α, β) = −2 logp(D|α, β) p(D| ˆα, ˆβ)

(15)

Exact −2 log Λ(α, β) 0.90 0.95 1.00 1.05 1.10 1.15 α 1.4 1.2 1.0 0.8 0.6 β Approx. LLR (smoothed by a Gaussian Process) 0.90 0.95 1.00 1.05 1.10 1.15 α 1.4 1.2 1.0 0.8 0.6 β 0 4 8 12 16 20 24 28 32

(16)

Diagnostics

In practice ˆr (ˆs(x; θ0, θ1)) will not be exact. Diagnostic procedures

are needed to assess the quality of this approximation.

1. For inference, the value of the MLE ˆθ should be independent of the value of θ1 used in the denominator of the ratio.

2. Train a classifier to distinguish between unweighted samples from p(x|θ0) and samples from p(x|θ1) weighted by

ˆ r (ˆs(x; θ0, θ1)). 0.7 0.8 0.9 1.0 1.1 1.2 1.3 α 0 2 4 6 8 10 12 14 − 2l og Λ (θ ) Exact Approx., θ1=(α =0,β =1) Approx., θ1=(α =1,β =−1) Approx., θ1=(α =0,β =−1) ±1σ, θ1=(α =0,β =−1) 0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0

True Positive Rate

p(x|θ1)r(x|θ0,θ1) exact

p(x|θ1) no weights

(17)

Density ratio estimation

The density ratio r (x; θ0, θ1) = p(x|θ_p(x|θ0₁)₎ appears in many other

fundamental statistical inference problems, including

• transfer learning,

• outlier detection,

• divergence estimation,

• ...

For all of them, the proposed approximation can be used as a drop-in replacement!

(18)

Transfer learning

Under the assumption that train and test data are drawn iid from a same distribution p, 1 N X xi L(ϕ(xi)) → Z L(ϕ(x))p(x)d x,

as training data increases, i.e. as N → ∞.

(19)

Transfer learning

Under the assumption that train and test data are drawn iid from a same distribution p, 1 N X xi L(ϕ(xi)) → Z L(ϕ(x))ptrain(x)d x,

But we want to be good on test data, i.e., minimize Z

L(ϕ(x))ptest(x)d x.

(20)

Importance weighting

Reweight samples by ptest(xi)

ptrain(xi), such that

1 N X xi ptest(xi) ptrain(xi) L(ϕ(xi)) → Z _p test(x) ptrain(x) L(ϕ(x))ptrain(x)d x,

Again, ptest(xi)

ptrain(xi) cannot be evaluated directly, but approximated

(21)

Example

p0 : α = −2, β = 2 versus p1 : α = 0, β = 0

(22)

Example

(23)

Summary

• We proposed an approach for approximating LR in the likelihood-free setup.

• Evaluating likelihood ratios reduces to supervised learning. Both problems are deeply connected.

• Alternative to Approximate Bayesian Computation, without the need to define a prior over parameters.

(24)

References

Baldi, P., Cranmer, K., Faucett, T., Sadowski, P., and Whiteson, D. (2016). Parameterized Machine Learning for High-Energy Physics. arXiv preprint arXiv:1601.07913.

Cranmer, K., Pavez, J., and Louppe, G. (2015). Approximating likelihood ratios with calibrated discriminative classifiers. arXiv preprint arXiv:1506.02169. Louppe, G., Cranmer, K., and Pavez, J. (2016). carl: a likelihood-free inference

toolbox. http://dx.doi.org/10.5281/zenodo.47798,