Approximating likelihood ratios with
calibrated classifiers
Gilles Louppe
June 22, 2016 MLHEP, Lund
Joint work with
Kyle Cranmer
New York University
Juan Pavez
Federico Santa Mar´ıa University
Testing for new physics
p(data|theory + X ) p(data|theory)
Likelihood-free setup
• Complex simulator p parameterized by θ;
• Samples x ∼ p can be generated on-demand;
• ... but the likelihood p(x|θ) cannot be evaluated!
Simple hypothesis testing
• Assume some observed data D = {x1, . . . , xn}; • Test a null θ = θ0 against an alternative θ = θ1;
• The Neyman-Pearson lemma states that the most powerful test statistic is λ(D; θ0, θ1) = Y x∈D pX(x|θ0) pX(x|θ1) .
Straight approximation
1. Approximate pX(x|θ0) and pX(x|θ1) individually, using density
estimation algorithms;
2. Evaluate their ratio r (x; θ0, θ1).
Works fine for low-dimensional data, but because of the curse of dimensionality, this is in general a difficult problem! Moreover,it is
not even necessary!
pX(x|θ0)
pX(x|θ1) = r (x; θ0, θ1)
/
When solving a problem of interest, do not solve a more general problem as an intermediate step. – Vladimir Vapnik
Likehood ratio invariance under change of variable
Theorem. The likelihood ratio is invariant under the change of variable U = s(X), provided s(x) is monotonic with r (x).
r (x) = pX(x|θ0) pX(x|θ1)
= pU(s(x)|θ0) pU(s(x)|θ1)
Approximating likelihood ratios with classifiers
• Well, a classifier trained to distinguish x ∼ p0 from x ∼ p1
approximates
s∗(x) = pX(x|θ1) pX(x|θ0) + pX(x|θ1)
, which is monotonic with r (x).
• Estimating p(s(x)|θ) is now easy, since the change of variable s(x) projects x in a 1D space, where only the informative content of the ratio is preserved.
This can be carried out using density estimation or calibration algorithms (histograms, KDE, isotonic regression, etc).
Inference and composite hypothesis testing
Approximated likelihood ratios can be used for inference, since ˆ θ = arg max θ p(D|θ) = arg max θ Y x∈D p(x|θ) p(x|θ1) = arg max θ Y x∈D p(s(x; θ, θ1)|θ) p(s(x; θ, θ1)|θ1) (1)
where θ1 is fixed and s(x; θ, θ1) is a family of classifiers
parameterized by (θ, θ1).
Accordingly, generalized (or profile) likelihood ratio tests can be evaluated in the same way.
Parameterized learning
For inference, we need to build a family s(x; θ, θ1) of classifiers. • One could build a classifier s independently for all θ, θ1. But
this is computationally expensive and would not guarantee a smooth evolution of s(x; θ, θ1) as θ varies.
• Solution: build a single parameterized classifier instead, where parameters are additional input features (Cranmer et al., 2015; Baldi et al., 2016). T := {}; while size(T ) < N do Draw θ0∼ πΘ0; Draw x ∼ p(x|θ0); T := T ∪ {((x, θ0, θ1), y = 0)}; Draw θ1∼ πΘ1; Draw x ∼ p(x|θ1); T := T ∪ {((x, θ0, θ1), y = 1)}; end while
Example: Inference from multidimensional data
Let assume 5D data x generated from the following process p0:
1. z := (z0, z1, z2, z3, z4), such that z0∼ N (µ = α, σ = 1), z1∼ N (µ = β, σ = 3), z2∼ Mixture(12N (µ = −2, σ = 1),12N (µ = 2, σ = 0.5)), z3∼ Exponential(λ = 3), and z4∼ Exponential(λ = 0.5);
2. x := Rz, where R is a fixed semi-positive definite 5 × 5 matrix defining a fixed projection of z into the observed space.
Our goal is to infer the values α and β based on D. 10 5 0 5 X 1 12 6 0 6 X2 6 3 0 3 X 3 3 0 3 X0 0 4 8 12 X 4 10 5 0 5 X1 12 6 0 6 X2 6 3 0 3 X3 0 4 8 12 X4 Observed data D
Example: Inference from multidimensional data
Recipe:
1. Build a single parameterized classifier s(x; θ0, θ1), in this case
a 2-layer NN trained on 5+2 features, with the alternative fixed to θ1 = (α = 0, β = 0).
2. Find the approximated MLE ˆα, ˆβ by solving Eqn.1.
Solve Eqn.1 using likelihood scans or through optimization. Since the generator is inexpensive, p(s(x; θ0, θ1)|θ) can be
calibrated on-the-fly, for every candidate (α, β), e.g. using histograms.
3. Construct the log-likelihood ratio (LLR) statistic −2 log Λ(α, β) = −2 logp(D|α, β) p(D| ˆα, ˆβ)
Exact −2 log Λ(α, β) 0.90 0.95 1.00 1.05 1.10 1.15 α 1.4 1.2 1.0 0.8 0.6 β Approx. LLR (smoothed by a Gaussian Process) 0.90 0.95 1.00 1.05 1.10 1.15 α 1.4 1.2 1.0 0.8 0.6 β 0 4 8 12 16 20 24 28 32
Diagnostics
In practice ˆr (ˆs(x; θ0, θ1)) will not be exact. Diagnostic procedures
are needed to assess the quality of this approximation.
1. For inference, the value of the MLE ˆθ should be independent of the value of θ1 used in the denominator of the ratio.
2. Train a classifier to distinguish between unweighted samples from p(x|θ0) and samples from p(x|θ1) weighted by
ˆ r (ˆs(x; θ0, θ1)). 0.7 0.8 0.9 1.0 1.1 1.2 1.3 α 0 2 4 6 8 10 12 14 − 2l og Λ (θ ) Exact Approx., θ1=(α =0,β =1) Approx., θ1=(α =1,β =−1) Approx., θ1=(α =0,β =−1) ±1σ, θ1=(α =0,β =−1) 0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0
True Positive Rate
p(x|θ1)r(x|θ0,θ1) exact
p(x|θ1) no weights
Density ratio estimation
The density ratio r (x; θ0, θ1) = p(x|θp(x|θ01)) appears in many other
fundamental statistical inference problems, including
• transfer learning,
• outlier detection,
• divergence estimation,
• ...
For all of them, the proposed approximation can be used as a drop-in replacement!
Transfer learning
Under the assumption that train and test data are drawn iid from a same distribution p, 1 N X xi L(ϕ(xi)) → Z L(ϕ(x))p(x)d x,
as training data increases, i.e. as N → ∞.
Transfer learning
Under the assumption that train and test data are drawn iid from a same distribution p, 1 N X xi L(ϕ(xi)) → Z L(ϕ(x))ptrain(x)d x,
as training data increases, i.e. as N → ∞.
But we want to be good on test data, i.e., minimize Z
L(ϕ(x))ptest(x)d x.
Importance weighting
Reweight samples by ptest(xi)
ptrain(xi), such that
1 N X xi ptest(xi) ptrain(xi) L(ϕ(xi)) → Z p test(x) ptrain(x) L(ϕ(x))ptrain(x)d x,
as training data increases, i.e. as N → ∞.
Again, ptest(xi)
ptrain(xi) cannot be evaluated directly, but approximated
Example
p0 : α = −2, β = 2 versus p1 : α = 0, β = 0
Example
Summary
• We proposed an approach for approximating LR in the likelihood-free setup.
• Evaluating likelihood ratios reduces to supervised learning. Both problems are deeply connected.
• Alternative to Approximate Bayesian Computation, without the need to define a prior over parameters.
References
Baldi, P., Cranmer, K., Faucett, T., Sadowski, P., and Whiteson, D. (2016). Parameterized Machine Learning for High-Energy Physics. arXiv preprint arXiv:1601.07913.
Cranmer, K., Pavez, J., and Louppe, G. (2015). Approximating likelihood ratios with calibrated discriminative classifiers. arXiv preprint arXiv:1506.02169. Louppe, G., Cranmer, K., and Pavez, J. (2016). carl: a likelihood-free inference
toolbox. http://dx.doi.org/10.5281/zenodo.47798,